2

I need to filter a List<object> so that I remove any items where a string property does not exist inside another List<string>.

I created this console app just to make sure I had the LINQ syntax correct:

class FooBar
{
    public int Id { get; set; }
    public string ValueName { get; set; }
}

and then...

List<FooBar> foobars = new List<FooBar>
{
    new FooBar { Id = 1, ValueName = "Val1" },
    new FooBar { Id = 2, ValueName = "Val2" },
    new FooBar { Id = 3, ValueName = "Val3" },
    new FooBar { Id = 4, ValueName = "Val4" }
};

List<string> myStrings = new List<string>
{
    "Val1",
    "Val3"
};

// Only keep records where ValueName is found in `myStrings`
foobars = foobars.Where(f => myStrings.Contains(f.ValueName)).ToList();

So, this line:

foobars = foobars.Where(f => myStrings.Contains(f.ValueName)).ToList();

does exactly what I want, and it gives me back these two records:

{ Id = 1, ValueName = "Val1" }
{ Id = 3, ValueName = "Val3" }

All's well. BUT... in the actual application, foobars has over 200k items, and myStrings has about 190k. And when that LINQ line gets executed, it takes upwards of 5 minutes to complete.

I'm clearly doing something wrong. 200k records isn't THAT big. And the real FooBar isn't all that complex (no nested objects, and only 9 properties).

What's going on here?

Julian
  • 33,915
  • 22
  • 119
  • 174
Casey Crookston
  • 13,016
  • 24
  • 107
  • 193
  • Are both your list sorted? Asking because of [this](https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array) – Justin Lessard Nov 12 '19 at 18:40
  • If it's in-memory, that seems like a long time. If what you are actually doing is going to a database and using `Contains`, you're pulling back 190k records while avoiding using the database's indexes. Are you using a database? – ps2goat Nov 12 '19 at 18:40
  • 1
    Use `HashSet myStrings` instead of `List` – mtkachenko Nov 12 '19 at 18:41
  • 2*4 is **almost** comparable to your actual code, which does 200K*190K string comparisons (38,000,000,000). Use a better data structure, like a HashSet. – CodeCaster Nov 12 '19 at 18:41
  • @JustinLessard, nope, not sorted. I will sort them both and try again. – Casey Crookston Nov 12 '19 at 18:41
  • Hashset would be best and or using a join would work as well. – Trevor Nov 12 '19 at 18:42
  • @ps2goat, yes, all in memory. – Casey Crookston Nov 12 '19 at 18:42
  • `foobars = (from foo in foobars join myString in myStrings on foo.ValueName equals myString select foo).ToList();` give that a shot as an alternative instead of doing a contains for every item. – Trevor Nov 12 '19 at 19:05
  • @Çöđěxěŕ, thanks! I've already tested the answer below by Julian, and it worked wonderfully. – Casey Crookston Nov 12 '19 at 19:11
  • @CaseyCrookston I understand, you're welcome, I posted it as a comment as an alternative to *try* and compare. On another note, calling `ToList()` is a very time consuming routine as well. – Trevor Nov 12 '19 at 19:12

1 Answers1

6

The problem here is that you're doing foobars.Where(f => myStrings.Contains(f.ValueName)) , so for every item in foobars your are checking all items of myStrings.

That scales quadratic. Also called O(n^2), read more here. So if you have 10+10 items, you do 100 checks(10*10), and if you have 10,000+10,000 items, you will do 100,000,000 checks. In your case you're doing 38,000,000,000+ checks ;)

Solution: create a HashSet from myStrings and use Contains of the HashSet.

e.g. replace with:

var myStringsSet = new HashSet<string>(myStrings);
foobars = foobars.Where(f => myStringsSet.Contains(f.ValueName)).ToList();

Now with 10,000+10,000 items, you will do 10,000 checks instead of 100,000,000. In your case that will 200,000 checks instead of 38,000,000,000.

Casey Crookston
  • 13,016
  • 24
  • 107
  • 193
Julian
  • 33,915
  • 22
  • 119
  • 174