Optimizing LINQ combining multiple lists into new generic list

Question

Given the following three lists:

    var FirstNames = new List<string>(){ "Bob", "Sondra", "Avery", "Von", "Randle", "Gwen", "Paisley" };
    var LastNames = new List<string>(){ "Anderson", "Carlson", "Vickers", "Black", "Schultz", "Marigold", "Johnson" };
    var Birthdates = new List<DateTime>()
                    { 
                        Convert.ToDateTime("11/12/1980"), 
                        Convert.ToDateTime("09/16/1978"), 
                        Convert.ToDateTime("05/18/1985"), 
                        Convert.ToDateTime("10/29/1980"), 
                        Convert.ToDateTime("01/19/1989"), 
                        Convert.ToDateTime("01/14/1972"), 
                        Convert.ToDateTime("02/20/1981") 
                    };

I'd like to combine them into a new generic type where the relationship the lists share is their position in the collection. i.e. FirstNames[0], LastNames[0], Birthdates[0] are related.

So I have come up with this LINQ, matching the indices, which seems to work fine for now:

    var students = from fn in FirstNames
                   from ln in LastNames
                   from bd in Birthdates
                   where FirstNames.IndexOf(fn) == LastNames.IndexOf(ln)
                   where FirstNames.IndexOf(fn) == Birthdates.IndexOf(bd)
                   select new { First = fn, Last = ln, Birthdate = bd.Date };

However, I have stressed tested this code (Each List<string> and List<DateTime> loaded with a few million records) and I run into SystemOutOfMemory Exception.

Is there any other way of writing out this query to achieve the same results more effectively using Linq?

Robert McKee · Accepted Answer · 2017-06-06T18:01:43.233

That is what Zip is for.

var result = FirstNames
  .Zip(LastNames, (f,l) => new {f,l})
  .Zip(BirthDates, (fl, b) => new {First=fl.f, Last = fl.l, BirthDate = b});

Regarding scaling:

int count = 50000000;
var FirstNames = Enumerable.Range(0, count).Select(x=>x.ToString());
var LastNames = Enumerable.Range(0, count).Select(x=>x.ToString());
var BirthDates = Enumerable.Range(0, count).Select(x=> DateTime.Now.AddSeconds(x));

var sw = new Stopwatch();
sw.Start();

var result = FirstNames
  .Zip(LastNames, (f,l) => new {f,l})
  .Zip(BirthDates, (fl, b) => new {First=fl.f, Last = fl.l, BirthDate = b});

foreach(var r in result)
{
    var x = r;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds); // Returns 69191 on my machine.

While these blow up with out of memory:

int count = 50000000;
var FirstNames = Enumerable.Range(0, count).Select(x=>x.ToString());
var LastNames = Enumerable.Range(0, count).Select(x=>x.ToString());
var BirthDates = Enumerable.Range(0, count).Select(x=> DateTime.Now.AddSeconds(x));

var sw = new Stopwatch();
sw.Start();

var FirstNamesList = FirstNames.ToList(); // Blows up in 32-bit .NET with out of Memory
var LastNamesList = LastNames.ToList();
var BirthDatesList = BirthDates.ToList();

var result = Enumerable.Range(0, FirstNamesList.Count())
    .Select(i => new 
                 { 
                     First = FirstNamesList[i], 
                     Last = LastNamesList[i], 
                     Birthdate = BirthDatesList[i] 
                 });

result = BirthDatesList.Select((bd, i) => new
{ 
    First = FirstNamesList[i], 
    Last = LastNamesList[i], 
    BirthDate = bd 
});

foreach(var r in result)
{
    var x = r;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);

At lower values, the cost of converting the Enumerables to a List is much more expensive than the additional object creation as well. Zip was approximately 30% faster than the indexed versions. As you add more columns, Zips advantage would likely shrink.

The performance characteristics are also very different. The Zip routine will start outputting answers almost immediately, while the others will start outputting answers only after the entire Enumerables have been read and converted to Lists, so if you take the results and do pagination on it with .Skip(x).Take(y), or check if something exists .Any(...) it will be magnitudes faster as it doesn't have to convert the entire enumerable.

Lastly, if it becomes performance critical, and you need to implement many results, you could consider extending zip to handle an arbitrary number of Enumerables like (shamelessly stolen from Jon Skeet - https://codeblog.jonskeet.uk/2011/01/14/reimplementing-linq-to-objects-part-35-zip/):

private static IEnumerable<TResult> Zip<TFirst, TSecond, TThird, TResult>( 
    IEnumerable<TFirst> first, 
    IEnumerable<TSecond> second,
    IEnumerable<TThird> third, 
    Func<TFirst, TSecond, TThird, TResult> resultSelector) 
{ 
    using (IEnumerator<TFirst> iterator1 = first.GetEnumerator()) 
    using (IEnumerator<TSecond> iterator2 = second.GetEnumerator()) 
    using (IEnumerator<TThird> iterator3 = third.GetEnumerator()) 
    { 
        while (iterator1.MoveNext() && iterator2.MoveNext() && iterator3.MoveNext()) 
        { 
            yield return resultSelector(iterator1.Current, iterator2.Current, iterator3.Current); 
        } 
    } 
}

Then you can do this:

var result = FirstNames
  .Zip(LastNames, BirthDates, (f,l,b) => new {First=f,Last=l,BirthDate=b});

And now you don't even have the issue of the middle object being created, so you get the best of all worlds.

Or use the implementation here to handle any number generically: Zip multiple/abitrary number of enumerables in C#

I originally looked into using `Zip` when this problem arose. I think I agree with you that is what it is engineered for, but do you think this solution would scale. All of the answers perform very similarly, using a simple stopwatch class. This creates a new list of First and Last names, then another one of First/Last names and Birthdates. If we added another list of say, age, would it scale with that respect? — Fus Ro Dah, Jun 06 '17 at 15:40
@FusRoDah It would depend heavily on the sources and how the data will be consumed. It does cost some performance to create temporary objects (I believe), however, if the sources and or consumption would benefit from Enumerables lazy evaluation, it may perform better or worse especially since in best case, it would require less memory as data is streamed from the sources one at a time, processed and then discarded, rather than assuming the sources are indexable (or converting to such). — Robert McKee, Jun 06 '17 at 17:15
Thank you for elaborating on the advantages and disadvantages as well as providing the code for comparison. Checking for `Any()` is exactly what the end product of our generic list is going to do. As such, I think you've more than completely answered my question. — Fus Ro Dah, Jun 06 '17 at 18:16

score 3 · Answer 2 · answered Jun 05 '17 at 21:53

3

Another option is to use Select overload with the indexer supplied:

var result = Birthdates.Select((bd, i) => new
{ 
    First = FirstNames[i], 
    Last = LastNames[i], 
    Birthdate = bd 
});

answered Jun 05 '17 at 21:53

Anton Sorokin

401
1
7
10

I really feel like knowing this will be beneficial to future applications, not even just related to this problem. I still have much to learn about LINQ so thank you – Fus Ro Dah Jun 06 '17 at 15:42

score 2 · Answer 3 · answered Jun 05 '17 at 21:44

2

Yeap, use range generator:

var result = Enumerable.Range(0, FirstNames.Count)
    .Select(i => new 
                 { 
                     First = FirstNames[i], 
                     Last = LastNames[i], 
                     Birthdate = Birthdates[i] 
                 });

answered Jun 05 '17 at 21:44

eocron

6,885
1
21
50

I like that if we needed to add another list we could just add one more short line of code with this answer: `Age = Ages[i]`. Support friendly for junior devs – Fus Ro Dah Jun 06 '17 at 15:47

Optimizing LINQ combining multiple lists into new generic list

3 Answers3

Linked