1

I have a very large collection that implements the generic IList<T> interface and contains tens of millions of elements, and I would like to process them in parallel using PLINQ. I noticed that the overhead of parallelism is quite significant because processing each individual element is very lightweight, so I am searching for ways to chunkify the processing by splitting the IList<T> into small segments. My goal is to have finally something like this:

IList<Person> source = GetAllPersons();

double averageAge = source
    .Segmentate(1000) // Hypothetical operator that segmentates the source
    .AsParallel()
    .Select(segment => segment.Select(person => (double)person.CalculateAge()).Sum())
    .Sum() / source.Count;

I could use the Batch operator from the MoreLinq library, or any of the answers from many related questions, but all of these solutions are allocating multiple small arrays (or lists), and are copying the source into these containers, and I don't want that. In my case I have the additional requirement of keeping the garbage collector idle as much as possible.

I noticed that the .NET has the ArraySegment type, that seems to fit perfectly with my requirements:

// Delimits a section of a one-dimensional array.
public readonly struct ArraySegment<T> : ICollection<T>, IEnumerable<T>,
    IEnumerable, IList<T>, IReadOnlyCollection<T>, IReadOnlyList<T>

I could use this type to implement the allocation-free Segmentate operator like this:

/// <summary>Segmentates the source array into sized segments.</summary>
public static IEnumerable<ArraySegment<T>> Segmentate<T>(this T[] source, int size)
{
    for (int offset = 0; offset < source.Length; offset += size)
    {
        yield return new ArraySegment<T>(source, offset,
            Math.Min(size, source.Length - offset));
    }
}

But I can't use this type because my source is an IList<T> and not an array. And copying it to an array is not really an option, because the source is mutated frequently. And creating new array copies all the time is against my requirements.

So I am searching for a ListSegment<T> type, but as far as I can see it doesn't exist in .NET. Do I have to implement it myself? And if so, how? Or is any other way to segmentate an IList<T> without causing allocations?


Clarification: My source collection is not a List<T>. It is a custom class that implements the IList<T> interface.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • 1
    You can create Spans of a list (https://stackoverflow.com/a/60514418/7565574) using CollectionMarshal.AsSpan(), which is the same as an ArraySegment. – ckuri Dec 10 '20 at 07:09
  • I'm getting 350 milliseconds for 150 000 000 records with something as simple as: double averageAge = source .AsParallel() .Average(p => (double)p.Age); Is this a fabricated example, or more precicesly is the question more around the concept of slicing things with no allocations? – Alexandru Clonțea Dec 10 '20 at 07:20
  • @AlexandruClonțea yeap, the example is fabricated. I added it just for demonstration of the concept. – Theodor Zoulias Dec 10 '20 at 07:23
  • @ckuri it seems that the `Span` type has [limitations](https://learn.microsoft.com/en-us/dotnet/api/system.span-1#remarks) that make it unsuitable for my use case. I changed the signature of my operator to `IEnumerable> Segmentate(...`, and I am getting a compile time error: *CS0306 The type 'Span' may not be used as a type argument.* – Theodor Zoulias Dec 10 '20 at 07:46
  • This is a good read that might interest you: http://download.microsoft.com/download/B/C/F/BCFD4868-1354-45E3-B71B-B851CD78733D/WhenToUseParallelForEachOrPLINQ.pdf It seems to me that you want to sit somewhere in between these two worlds in this scenario you have described (PLINQ and Parallel). I was thinking of one solution that would yield parts of the collection, but that seems like Parallel with extra steps to me. – Alexandru Clonțea Dec 10 '20 at 07:48
  • 1
    @AlexandruClonțea thanks for the link, I'll read the linked document. Yes, I want the power of the `Parallel` combined with the beauty and convenience of the PLINQ. From my experience, using the most powerful overloads of the `Parallel.ForEach` can become [very complex and ugly](https://stackoverflow.com/questions/65204040/converting-from-a-foreach-loop-to-a-parallel-foreach-loop-when-summarizing-into/65205661#65205661)... – Theodor Zoulias Dec 10 '20 at 07:58
  • @AlexandruClonțea btw your PC must be 10 times faster than mine, because averaging 150,000,000 double precision numbers with PLINQ takes 3 seconds in my PC. :-) – Theodor Zoulias Dec 10 '20 at 12:20
  • i9-9900K xD That was not the point. I think it's reasonable to get the result in <15 seconds, but it depends on the use case. One point was that rule #1 of parallelization is "don't". Point #2: I don't like reinventing wheels unless I want to learn about wheelmaking. For sure, it's an interesting domain to dwell into, but in all cases where I've needed very complex parallelization, I've ended up with solutions that are "tied" to the specific problem rather than something generally applicable. I like the idea of the operation itself though, but there must be a reason it's missing from the sdk. – Alexandru Clonțea Dec 10 '20 at 12:46
  • Side note, for the specific "synthetic" use case of averaging, I'd probably look for a way to do it "on the go" or make a predictive model for it rather than always doing the expensive thing, especially if we go from 150 000 000 to something orders of magnitude larger :) – Alexandru Clonțea Dec 10 '20 at 12:49
  • 1
    @AlexandruClonțea I agree with all of your points. The `averageAge` example is probably a bit silly, but this is what came in my mind when I was searching for an example for this question. :-) – Theodor Zoulias Dec 10 '20 at 13:11

2 Answers2

4

You need to implement an ArraySegment<T> equivalent for IList<T>. See implementation below. For optimal performance, consider using spans instead.

ListSegment<T> Struct

public readonly struct ListSegment<T> : IList<T>
{
    public List<T> Items { get; }
    public int Offset { get; }
    public int Count { get; }

    public ListSegment(List<T> items, int offset, int count)
    {
        Items = items ?? throw new ArgumentNullException(nameof(items));
        Offset = offset;
        Count = count;

        if (items.Count < offset + count)
        {
            throw new ArgumentException("List segment out of range.", nameof(count));
        }
    }

    public void CopyTo(T[] array, int index)
    {
        if (Count > 0)
        {
            Items.CopyTo(Offset, array, index, Count);
        }
    }

    public bool Contains(T item) => IndexOf(item) != -1;

    public int IndexOf(T item)
    {
        for (var i = Offset; i < Offset + Count; i++)
        {
            if (Items[i].Equals(item))
            {
                return i;
            }
        }

        return -1;
    }

    private T ElementAt(int index)
    {
        if (Count > 0)
        {
            return Items[Offset + index];
        }

        throw new ArgumentOutOfRangeException(nameof(index));
    }

    public ListSegmentEnumerator GetEnumerator() => new ListSegmentEnumerator(this);

    #region IEnumerable<T> interface
    IEnumerator<T> IEnumerable<T>.GetEnumerator() => GetEnumerator();
    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
    #endregion

    #region ICollection<T> interface
    bool ICollection<T>.IsReadOnly => true;

    void ICollection<T>.Add(T item) => throw new NotImplementedException();
    bool ICollection<T>.Remove(T item) => throw new NotImplementedException();
    void ICollection<T>.Clear() => throw new NotImplementedException();
    #endregion

    #region IList<T> interface
    void IList<T>.Insert(int index, T item) => throw new NotImplementedException();
    void IList<T>.RemoveAt(int index) => throw new NotImplementedException();
    T IList<T>.this[int index]
    {
        get => ElementAt(index);
        set => throw new NotImplementedException();
    }
    #endregion

    public struct ListSegmentEnumerator : IEnumerator<T>
    {
        private readonly List<T> items;
        private readonly int start;
        private readonly int end;
        private int current;

        public ListSegmentEnumerator(ListSegment<T> segment)
        {
            items = segment.Items;
            start = segment.Offset;
            end = start + segment.Count;
            current = start - 1;
        }

        public bool MoveNext()
        {
            if (current < end)
            {
                current++;

                return current < end;
            }
            return false;
        }

        public T Current => items[current];
        object IEnumerator.Current => Current;

        void IEnumerator.Reset() => current = start - 1;
        void IDisposable.Dispose() { }
    }
}
l33t
  • 18,692
  • 16
  • 103
  • 180
  • Thanks l33t for the answer! It seems that this class is (close to) what I am looking for. But shouldn't the `public List Items` be defined as `public IList Items`? My source is not actually a `List`. – Theodor Zoulias Dec 10 '20 at 10:19
  • 1
    You can change it to `IList`. I'm using the concrete `List` here to enforce strong typing. The concrete class will perform better in certain situations (e.g. during enumeration, as explained [here](https://stackoverflow.com/a/62663221/419761)). – l33t Dec 10 '20 at 11:09
  • Yeap, I am experimenting with your class and I am realizing that things are getting faster when working with concrete classes instead of interfaces. I can see that implementing a concrete enumerator is also beneficial, compared to an iterator that yields. Unfortunately it is not practical for me to have a different `ListSegment`-like class for every `IList` implementation that I have to work with. My intention is to use it with PLINQ, which is already working with interfaces, so I'll have to accept the performance hit (as the cost of the convenience). – Theodor Zoulias Dec 10 '20 at 12:15
  • 1
    This `ListSegment` struct is not using the enumerator of the `items` variable. You can safely use `IList` instead. However, if you iterate over the public `Items` property you will indeed see a small performance hit. – l33t Dec 10 '20 at 12:46
-1

This answer assumes that your concrete IList, is of type List. You can use the GetRange function, which pretty much does what you want:

A shallow copy of a collection of reference types, or a subset of that collection, contains only the references to the elements of the collection. The objects themselves are not copied. The references in the new list point to the same objects as the references in the original list.

Even the ArraySegment<T> will create some kind of reference object to store the array segment, so you can't completely avoid that.

If you want to avoid storing the references (aka shallow copy), then a Span would be in order, but your comment that your collection changes conflicts with this

Items should not be added or removed from the List while the Span is in use.

So, your only other solution, would be to create one yourself as you mentioned.

Warning: There is a reason why such a thing does not exist built in. The array is fixed size, so getting a segment is much safer to handle. Be careful of unexpected consequences and side-effects when creating such a construct. This is the reason why the Span warns you that it's unsafe. You only know your requirements and how your data changes, so your collection wrapper should take them into account and handle them accordingly.

Athanasios Kataras
  • 25,191
  • 4
  • 32
  • 61
  • 1
    Thanks Athanasios for the answer, but the `List.GetRange` allocates a new `List`, and this is exactly what I am trying to avoid! – Theodor Zoulias Dec 10 '20 at 07:12
  • So to make sure I understand, you just don't want to maintain the space it takes to store the references to the same objects? – Athanasios Kataras Dec 10 '20 at 07:16
  • 1
    Let's say that my source has 50,000,000 elements, and I want to process them in segments of 1,000 elements each. I want to avoid allocating 50,000 throw-away lists, one for each segment, just to do this calculation. I am OK with allocating a single object, the `IList` enumerator. – Theodor Zoulias Dec 10 '20 at 07:21
  • Yeap, got it. The `GetRange` will not do the trick for you. Check the updated answer for details. – Athanasios Kataras Dec 10 '20 at 07:22
  • Athanasios just to clarify, my `IList` source is not mutated *during* the parallel processing, because it is not thread-safe to begin with. It is mutated *between* one parallel processing and the next. – Theodor Zoulias Dec 10 '20 at 07:28
  • If every range is done for and none is used when you change them, the `Span` implementation can work for you, The only downside is that it does not implement `IEnumerable` so you'll have to iterate it the conventional way. – Athanasios Kataras Dec 10 '20 at 07:32
  • By the way, it's a very interesting pointer reference implementation: https://source.dot.net/#System.Private.CoreLib/Span.cs,d2517139cac388e8 – Athanasios Kataras Dec 10 '20 at 07:33
  • Athanasios I had no luck with the `Span` type. See my answer to ckuri's comment under the question. – Theodor Zoulias Dec 10 '20 at 07:53
  • 1
    So it comes down to creating your own implementation (with the considerations above), or changing the `yield return` and `IEnumerable` implementation and passing a size and an index to your function, to return the appropriate segment. I would go with the second solution. – Athanasios Kataras Dec 10 '20 at 07:59
  • That's why I mentioned you'll have to iterate in the conventional way. – Athanasios Kataras Dec 10 '20 at 08:00
  • Confusing... `IList` != `IList` and `IEnumerable` != `IEnumerable`. So which ones are you referring to? (The non-generic ones will box each element during enumeration, which can be extremely costly.) – l33t Dec 10 '20 at 09:53