Создавать партии в linq

Question 1

Может ли кто-нибудь предложить способ создания пакетов определенного размера в linq?

В идеале я хочу иметь возможность выполнять операции кусками некоторого настраиваемого количества.

Question 2

Вам не нужно писать код. Используйте метод MoreLINQ Batch, который группирует исходную последовательность в сегменты определенного размера (MoreLINQ доступен как пакет NuGet, который вы можете установить):

int size = 10;
var batches = sequence.Batch(size);

Что реализовано как:

public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
                  this IEnumerable<TSource> source, int size)
{
    TSource[] bucket = null;
    var count = 0;

    foreach (var item in source)
    {
        if (bucket == null)
            bucket = new TSource[size];

        bucket[count++] = item;
        if (count != size)
            continue;

        yield return bucket;

        bucket = null;
        count = 0;
    }

    if (bucket != null && count > 0)
        yield return bucket.Take(count).ToArray();
}

Question 3

public static class MyExtensions
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> items,
                                                       int maxItems)
    {
        return items.Select((item, inx) => new { item, inx })
                    .GroupBy(x => x.inx / maxItems)
                    .Select(g => g.Select(x => x.item));
    }
}

и использование будет:

List<int> list = new List<int>() { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };

foreach(var batch in list.Batch(3))
{
    Console.WriteLine(String.Join(",",batch));
}

ВЫВОД:

0,1,2
3,4,5
6,7,8
9

Question 4

Если вы начинаете с sequenceопределенного как an IEnumerable<T>и знаете, что его можно безопасно перечислять несколько раз (например, потому что это массив или список), вы можете просто использовать этот простой шаблон для обработки элементов в пакетах:

while (sequence.Any())
{
    var batch = sequence.Take(10);
    sequence = sequence.Skip(10);

    // do whatever you need to do with each batch here
}

Question 5

Все вышеперечисленное ужасно работает с большими пакетами или малым объемом памяти. Пришлось написать свой собственный конвейер (нигде нет накопления элементов):

public static class BatchLinq {
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size) {
        if (size <= 0)
            throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");

        using (IEnumerator<T> enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
                yield return TakeIEnumerator(enumerator, size);
    }

    private static IEnumerable<T> TakeIEnumerator<T>(IEnumerator<T> source, int size) {
        int i = 0;
        do
            yield return source.Current;
        while (++i < size && source.MoveNext());
    }
}

Изменить: известная проблема с этим подходом заключается в том, что каждый пакет должен быть полностью пронумерован перед переходом к следующему пакету. Например, это не работает:

//Select first item of every 100 items
Batch(list, 100).Select(b => b.First())

Question 6

Это полностью ленивая реализация Batch с низкими накладными расходами и одной функцией, которая не выполняет никакого накопления. На основе (и исправляет проблемы в) Ник Уэйли в растворе с помощью EricRoller.

Итерация происходит непосредственно из базового IEnumerable, поэтому элементы необходимо перечислять в строгом порядке и обращаться к ним не более одного раза. Если некоторые элементы не используются во внутреннем цикле, они отбрасываются (и попытка снова получить к ним доступ через сохраненный итератор вызоветInvalidOperationException: Enumeration already finished. ).

Вы можете протестировать полный образец на .NET Fiddle .

public static class BatchLinq
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
    {
        if (size <= 0)
            throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");
        using (var enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
            {
                int i = 0;
                // Batch is a local function closing over `i` and `enumerator` that
                // executes the inner batch enumeration
                IEnumerable<T> Batch()
                {
                    do yield return enumerator.Current;
                    while (++i < size && enumerator.MoveNext());
                }

                yield return Batch();
                while (++i < size && enumerator.MoveNext()); // discard skipped items
            }
    }
}

Question 7

Интересно, почему никто никогда не публиковал старое школьное решение для цикла. Вот один из них:

List<int> source = Enumerable.Range(1,23).ToList();
int batchsize = 10;
for (int i = 0; i < source.Count; i+= batchsize)
{
    var batch = source.Skip(i).Take(batchsize);
}

Эта простота возможна, потому что метод Take:

... перечисляет sourceи возвращает элементы до countтех пор, пока элементы не будут возвращены или не sourceсодержат больше элементов. Если countпревышает количество элементов в source, sourceвозвращаются все элементы

Отказ от ответственности:

Использование Skip and Take внутри цикла означает, что перечисляемое будет перечисляться несколько раз. Это опасно, если перечисление отложено. Это может привести к многократному выполнению запроса к базе данных, веб-запроса или чтения файла. Этот пример явно предназначен для использования списка, который не является отложенным, поэтому это меньшая проблема. Это все еще медленное решение, поскольку skip будет перечислять коллекцию при каждом ее вызове.

Эту проблему также можно решить с помощью этого GetRangeметода, но для извлечения возможной остаточной партии требуется дополнительный расчет:

for (int i = 0; i < source.Count; i += batchsize)
{
    int remaining = source.Count - i;
    var batch = remaining > batchsize  ? source.GetRange(i, batchsize) : source.GetRange(i, remaining);
}

Вот третий способ справиться с этим, который работает с двумя петлями. Это гарантирует, что коллекция будет перечислена только 1 раз !:

int batchsize = 10;
List<int> batch = new List<int>(batchsize);

for (int i = 0; i < source.Count; i += batchsize)
{
    // calculated the remaining items to avoid an OutOfRangeException
    batchsize = source.Count - i > batchsize ? batchsize : source.Count - i;
    for (int j = i; j < i + batchsize; j++)
    {
        batch.Add(source[j]);
    }           
    batch.Clear();
}

Question 8

Тот же подход, что и MoreLINQ, но с использованием списка вместо массива. Я не проводил сравнительный анализ, но для некоторых читаемость важнее:

    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
    {
        List<T> batch = new List<T>();

        foreach (var item in source)
        {
            batch.Add(item);

            if (batch.Count >= size)
            {
                yield return batch;
                batch.Clear();
            }
        }

        if (batch.Count > 0)
        {
            yield return batch;
        }
    }

Question 9

Вот попытка улучшения ленивых реализаций Ника Уэйли ( ссылка ) и infogulch ( ссылка ) Batch. Этот строгий. Вы либо перечисляете пакеты в правильном порядке, либо получаете исключение.

public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
    this IEnumerable<TSource> source, int size)
{
    if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size));
    using (var enumerator = source.GetEnumerator())
    {
        int i = 0;
        while (enumerator.MoveNext())
        {
            if (i % size != 0) throw new InvalidOperationException(
                "The enumeration is out of order.");
            i++;
            yield return GetBatch();
        }
        IEnumerable<TSource> GetBatch()
        {
            while (true)
            {
                yield return enumerator.Current;
                if (i % size == 0 || !enumerator.MoveNext()) break;
                i++;
            }
        }
    }
}

А вот и ленивая Batchреализация для источников типа IList<T>. Это не накладывает никаких ограничений на перечисление. Пакеты можно перечислять частично, в любом порядке и более одного раза. Тем не менее, ограничение не изменять коллекцию во время перечисления все еще действует. Это достигается за счет фиктивного вызова enumerator.MoveNext()перед выдачей какого-либо фрагмента или элемента. Обратной стороной является то, что перечислитель не используется, поскольку неизвестно, когда завершится перечисление.

public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
    this IList<TSource> source, int size)
{
    if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size));
    var enumerator = source.GetEnumerator();
    for (int i = 0; i < source.Count; i += size)
    {
        enumerator.MoveNext();
        yield return GetChunk(i, Math.Min(i + size, source.Count));
    }
    IEnumerable<TSource> GetChunk(int from, int toExclusive)
    {
        for (int j = from; j < toExclusive; j++)
        {
            enumerator.MoveNext();
            yield return source[j];
        }
    }
}

Question 10

Так что с функциональной шляпой это кажется тривиальным ... но в C # есть некоторые существенные недостатки.

вы, вероятно, расценили бы это как развертывание IEnumerable (погуглите, и вы, вероятно, попадете в некоторые документы Haskell, но могут быть некоторые вещи F #, использующие развертывание, если вы знаете F #, прищурись на документы Haskell, и он сделает смысл).

Развертывание связано со свертыванием («агрегатом»), за исключением того, что вместо итерации через входной IEnumerable он выполняет итерацию через структуры выходных данных (аналогичные отношения между IEnumerable и IObservable, на самом деле я думаю, что IObservable действительно реализует «развертывание», называемое генерировать. ..)

в любом случае сначала вам понадобится метод разворачивания, я думаю, это сработает (к сожалению, он в конечном итоге взорвет стек для больших «списков» ... вы можете безопасно написать это на F #, используя yield!, а не concat);

    static IEnumerable<T> Unfold<T, U>(Func<U, IEnumerable<Tuple<U, T>>> f, U seed)
    {
        var maybeNewSeedAndElement = f(seed);

        return maybeNewSeedAndElement.SelectMany(x => new[] { x.Item2 }.Concat(Unfold(f, x.Item1)));
    }

это немного глупо, потому что C # не реализует некоторые вещи, которые функциональные языки принимают как должное ... но он в основном принимает начальное число, а затем генерирует ответ «Может быть» следующего элемента в IEnumerable и следующего начального числа (Может быть не существует в C #, поэтому мы использовали IEnumerable, чтобы подделать его) и объединяем остальную часть ответа (я не могу поручиться за сложность этого «O (n?)»).

Как только вы это сделаете;

    static IEnumerable<IEnumerable<T>> Batch<T>(IEnumerable<T> xs, int n)
    {
        return Unfold(ys =>
            {
                var head = ys.Take(n);
                var tail = ys.Skip(n);
                return head.Take(1).Select(_ => Tuple.Create(tail, head));
            },
            xs);
    }

все выглядит довольно чисто ... вы берете «n» элементов в качестве «следующего» элемента в IEnumerable, а «хвост» - это остальная часть необработанного списка.

если в голове ничего нет ... вы закончили ... вы возвращаете "Nothing" (но имитируете пустой IEnumerable>) ... в противном случае вы возвращаете элемент головы и хвост для обработки.

вы, вероятно, можете сделать это с помощью IObservable, вероятно, там уже есть метод типа «Batch», и вы, вероятно, можете его использовать.

Если риск переполнения стека беспокоит (вероятно, должно), тогда вам следует реализовать его на F # (и, вероятно, уже есть какая-то библиотека F # (FSharpX?) С этим).

(Я провел только несколько элементарных тестов, так что там могут быть странные ошибки).

Question 11

Я присоединяюсь к этому очень поздно, но я нашел кое-что более интересное.

Так что мы можем использовать здесь Skipи Takeдля лучшей производительности.

public static class MyExtensions
    {
        public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> items, int maxItems)
        {
            return items.Select((item, index) => new { item, index })
                        .GroupBy(x => x.index / maxItems)
                        .Select(g => g.Select(x => x.item));
        }

        public static IEnumerable<T> Batch2<T>(this IEnumerable<T> items, int skip, int take)
        {
            return items.Skip(skip).Take(take);
        }

    }

Затем я проверил 100000 записей. Только цикл занимает больше времени в случаеBatch

Код консольного приложения.

static void Main(string[] args)
{
    List<string> Ids = GetData("First");
    List<string> Ids2 = GetData("tsriF");

    Stopwatch FirstWatch = new Stopwatch();
    FirstWatch.Start();
    foreach (var batch in Ids2.Batch(5000))
    {
        // Console.WriteLine("Batch Ouput:= " + string.Join(",", batch));
    }
    FirstWatch.Stop();
    Console.WriteLine("Done Processing time taken:= "+ FirstWatch.Elapsed.ToString());


    Stopwatch Second = new Stopwatch();

    Second.Start();
    int Length = Ids2.Count;
    int StartIndex = 0;
    int BatchSize = 5000;
    while (Length > 0)
    {
        var SecBatch = Ids2.Batch2(StartIndex, BatchSize);
        // Console.WriteLine("Second Batch Ouput:= " + string.Join(",", SecBatch));
        Length = Length - BatchSize;
        StartIndex += BatchSize;
    }

    Second.Stop();
    Console.WriteLine("Done Processing time taken Second:= " + Second.Elapsed.ToString());
    Console.ReadKey();
}

static List<string> GetData(string name)
{
    List<string> Data = new List<string>();
    for (int i = 0; i < 100000; i++)
    {
        Data.Add(string.Format("{0} {1}", name, i.ToString()));
    }

    return Data;
}

Время такое.

Первый - 00: 00: 00.0708, 00: 00: 00.0660

Второй (Take and Skip One) - 00: 00: 00.0008, 00: 00: 00.0008

Question 12

Я написал специальную реализацию IEnumerable, которая работает без linq и гарантирует единое перечисление данных. Он также выполняет все это, не требуя резервных списков или массивов, которые вызывают взрыв памяти в больших наборах данных.

Вот несколько основных тестов:

    [Fact]
    public void ShouldPartition()
    {
        var ints = new List<int> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
        var data = ints.PartitionByMaxGroupSize(3);
        data.Count().Should().Be(4);

        data.Skip(0).First().Count().Should().Be(3);
        data.Skip(0).First().ToList()[0].Should().Be(0);
        data.Skip(0).First().ToList()[1].Should().Be(1);
        data.Skip(0).First().ToList()[2].Should().Be(2);

        data.Skip(1).First().Count().Should().Be(3);
        data.Skip(1).First().ToList()[0].Should().Be(3);
        data.Skip(1).First().ToList()[1].Should().Be(4);
        data.Skip(1).First().ToList()[2].Should().Be(5);

        data.Skip(2).First().Count().Should().Be(3);
        data.Skip(2).First().ToList()[0].Should().Be(6);
        data.Skip(2).First().ToList()[1].Should().Be(7);
        data.Skip(2).First().ToList()[2].Should().Be(8);

        data.Skip(3).First().Count().Should().Be(1);
        data.Skip(3).First().ToList()[0].Should().Be(9);
    }

Метод расширения для разделения данных.

/// <summary>
/// A set of extension methods for <see cref="IEnumerable{T}"/>. 
/// </summary>
public static class EnumerableExtender
{
    /// <summary>
    /// Splits an enumerable into chucks, by a maximum group size.
    /// </summary>
    /// <param name="source">The source to split</param>
    /// <param name="maxSize">The maximum number of items per group.</param>
    /// <typeparam name="T">The type of item to split</typeparam>
    /// <returns>A list of lists of the original items.</returns>
    public static IEnumerable<IEnumerable<T>> PartitionByMaxGroupSize<T>(this IEnumerable<T> source, int maxSize)
    {
        return new SplittingEnumerable<T>(source, maxSize);
    }
}

Это класс реализации

    using System.Collections;
    using System.Collections.Generic;

    internal class SplittingEnumerable<T> : IEnumerable<IEnumerable<T>>
    {
        private readonly IEnumerable<T> backing;
        private readonly int maxSize;
        private bool hasCurrent;
        private T lastItem;

        public SplittingEnumerable(IEnumerable<T> backing, int maxSize)
        {
            this.backing = backing;
            this.maxSize = maxSize;
        }

        public IEnumerator<IEnumerable<T>> GetEnumerator()
        {
            return new Enumerator(this, this.backing.GetEnumerator());
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return this.GetEnumerator();
        }

        private class Enumerator : IEnumerator<IEnumerable<T>>
        {
            private readonly SplittingEnumerable<T> parent;
            private readonly IEnumerator<T> backingEnumerator;
            private NextEnumerable current;

            public Enumerator(SplittingEnumerable<T> parent, IEnumerator<T> backingEnumerator)
            {
                this.parent = parent;
                this.backingEnumerator = backingEnumerator;
                this.parent.hasCurrent = this.backingEnumerator.MoveNext();
                if (this.parent.hasCurrent)
                {
                    this.parent.lastItem = this.backingEnumerator.Current;
                }
            }

            public bool MoveNext()
            {
                if (this.current == null)
                {
                    this.current = new NextEnumerable(this.parent, this.backingEnumerator);
                    return true;
                }
                else
                {
                    if (!this.current.IsComplete)
                    {
                        using (var enumerator = this.current.GetEnumerator())
                        {
                            while (enumerator.MoveNext())
                            {
                            }
                        }
                    }
                }

                if (!this.parent.hasCurrent)
                {
                    return false;
                }

                this.current = new NextEnumerable(this.parent, this.backingEnumerator);
                return true;
            }

            public void Reset()
            {
                throw new System.NotImplementedException();
            }

            public IEnumerable<T> Current
            {
                get { return this.current; }
            }

            object IEnumerator.Current
            {
                get { return this.Current; }
            }

            public void Dispose()
            {
            }
        }

        private class NextEnumerable : IEnumerable<T>
        {
            private readonly SplittingEnumerable<T> splitter;
            private readonly IEnumerator<T> backingEnumerator;
            private int currentSize;

            public NextEnumerable(SplittingEnumerable<T> splitter, IEnumerator<T> backingEnumerator)
            {
                this.splitter = splitter;
                this.backingEnumerator = backingEnumerator;
            }

            public bool IsComplete { get; private set; }

            public IEnumerator<T> GetEnumerator()
            {
                return new NextEnumerator(this.splitter, this, this.backingEnumerator);
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return this.GetEnumerator();
            }

            private class NextEnumerator : IEnumerator<T>
            {
                private readonly SplittingEnumerable<T> splitter;
                private readonly NextEnumerable parent;
                private readonly IEnumerator<T> enumerator;
                private T currentItem;

                public NextEnumerator(SplittingEnumerable<T> splitter, NextEnumerable parent, IEnumerator<T> enumerator)
                {
                    this.splitter = splitter;
                    this.parent = parent;
                    this.enumerator = enumerator;
                }

                public bool MoveNext()
                {
                    this.parent.currentSize += 1;
                    this.currentItem = this.splitter.lastItem;
                    var hasCcurent = this.splitter.hasCurrent;

                    this.parent.IsComplete = this.parent.currentSize > this.splitter.maxSize;

                    if (this.parent.IsComplete)
                    {
                        return false;
                    }

                    if (hasCcurent)
                    {
                        var result = this.enumerator.MoveNext();

                        this.splitter.lastItem = this.enumerator.Current;
                        this.splitter.hasCurrent = result;
                    }

                    return hasCcurent;
                }

                public void Reset()
                {
                    throw new System.NotImplementedException();
                }

                public T Current
                {
                    get { return this.currentItem; }
                }

                object IEnumerator.Current
                {
                    get { return this.Current; }
                }

                public void Dispose()
                {
                }
            }
        }
    }

Question 13

Еще одна реализация в одну строку. Он работает даже с пустым списком, в этом случае вы получаете коллекцию пакетов нулевого размера.

var aList = Enumerable.Range(1, 100).ToList(); //a given list
var size = 9; //the wanted batch size
//number of batches are: (aList.Count() + size - 1) / size;

var batches = Enumerable.Range(0, (aList.Count() + size - 1) / size).Select(i => aList.GetRange( i * size, Math.Min(size, aList.Count() - i * size)));

Assert.True(batches.Count() == 12);
Assert.AreEqual(batches.ToList().ElementAt(0), new List<int>() { 1, 2, 3, 4, 5, 6, 7, 8, 9 });
Assert.AreEqual(batches.ToList().ElementAt(1), new List<int>() { 10, 11, 12, 13, 14, 15, 16, 17, 18 });
Assert.AreEqual(batches.ToList().ElementAt(11), new List<int>() { 100 });

Question 14

Я знаю, что все использовали сложные системы для выполнения этой работы, и я действительно не понимаю, почему. Take and skip разрешит все эти операции с использованием общей функции выбора с Func<TSource,Int32,TResult>преобразованием. Подобно:

public IEnumerable<IEnumerable<T>> Buffer<T>(IEnumerable<T> source, int size)=>
    source.Select((item, index) => source.Skip(size * index).Take(size)).TakeWhile(bucket => bucket.Any());

Question 15

Другой способ - использовать оператор Rx Buffer.

//using System.Linq;
//using System.Reactive.Linq;
//using System.Reactive.Threading.Tasks;

var observableBatches = anAnumerable.ToObservable().Buffer(size);

var batches = aList.ToObservable().Buffer(size).ToList().ToTask().GetAwaiter().GetResult();

Question 16

    static IEnumerable<IEnumerable<T>> TakeBatch<T>(IEnumerable<T> ts,int batchSize)
    {
        return from @group in ts.Select((x, i) => new { x, i }).ToLookup(xi => xi.i / batchSize)
               select @group.Select(xi => xi.x);
    }