Fast way of extraction of a list based on another list

Question

List1: {"123456", "432978", "321675", …}  // containing 100,000 members

List2: {"7674543897", "1234568897", "8899776644",…} // containing 500,000 members

I want to extract all items in List2 that their first 6 digits are from List1 members, so here the string “1234568897” is valid because its first 6 digits are from List1’s first item. What it the fastest way of doing this?

    foreach(string id in List1)
    {
    string result = List2.FirstOrDefault(x => x.Contains(id));
    if(result!=null)
      {
      //some works here
      }
}

this works for a group of less than 1000 but when List2 items grows this takes too long

What have you tried already? what timing mechanisms and tests have you set up on your attempts so far? — Immortal Blue, Mar 06 '13 at 11:50
with a single foreach loop this takes 5 minutes to give result. i have tried: List2.FirstOrDefault(x => x.Contains(id)) and th id is placed in foreach loop iterating through all items in List1 — Hossein Amini, Mar 06 '13 at 11:56

score 3 · Accepted Answer · edited May 23 '17 at 12:27

3

You can use Enumerable.Join which is quite efficient:

var match = from str1 in List1
            join str2 in List2
            on str1 equals (str2.Length < 6 ? str2 : str2.Substring(0, 6))
            select str2;

Demo

Edit

Since @Oleksandr Pshenychnyy assumed that it would be very slow with such big collections, here is a demo with 100000 random strings in list1 and 500000 strings in list2 with the same range as in the question. It executes in 600 milliseconds on ideone:

http://ideone.com/rB6LU4

edited May 23 '17 at 12:27

Community

1
1

answered Mar 06 '13 at 11:51

Tim Schmelter

450,073
74
686
939

That would be very slow approach for such a big collections – Sasha Mar 06 '13 at 12:04
@OleksandrPshenychnyy: I have tested it with 100000 random strings in list1 and 5000000 strings in list2 with the same range and it executed in 700 milliseconds. I will provide a demo soon. – Tim Schmelter Mar 06 '13 at 12:13
And I can expect the Aho-Corasic to execute in around 10 ms (though I never implemented it as it is too complicated =) ). I answered you only because the question was about the fastest algorithm, not the simplest to implement. – Sasha Mar 06 '13 at 12:42
@OleksandrPshenychnyy: I don't know which computer you are using. But your code errors out on ideone(i assume timeout or memory). On my server it takes 6 times more than mine(3900ms instead of 690). Read my link above to see why `Enumerable.Join` is fast. It uses `HashSets` internally(very fast lookup performance). – Tim Schmelter Mar 06 '13 at 12:47

score 0 · Answer 2 · answered Mar 06 '13 at 12:05

It depends very much on the hardware you're running on. You may well be engaging in premature optimisation though. It may be fast enough simply brute forcing it. 500,000 * 100,000 being your maximum number of comparisons (i.e. if nothing matched) That really won't take very long on a reasonable spec desktop PC.

If you find that's too slow for your purposes then there are a few techniques you could use to improve the performance:

Parallelise it. This will only show big benefits on multi-core hardware. Essentially, you'd want a dispatcher class that feeds numbers from List2 to threads that execute the search for a match in List1. See the Task Parellel Library.
Reduce number of comparisons by being smarter. Do some pre-analysis on your collections to improve their characteristics for the comparison step. e.g. put items from List1 into a list of 'buckets' where each bucket contains all the sequences with the same first 2 numbers. e.g. bucket 1 would contain those starting "00", bucket 2 "01", etc. When you do the actual comparison, you would only have to compare a small number of strings. From your example, for "1234568897", we would check the bucket for "12" and then we know we only have to do the full string comparison with the strings in that bucket.

This kind of pre-processing may actually make things slower though so make sure you profile your code to be sure.

Sasha · Answer 3 · 2013-03-06T12:52:24.133

0

The most efficient way to implement what you need is the algorithm of Aho-Corasick - if you need to process new elements of List 2 dynamically. It is based on prefix-trees, a commonly used technique in string search algorithms. Those algorithm will give you complexity of O(sum of all string lengths)

The other option is Knuth-Morris-Pratt algorithm which will give you the same complexity, but initially works with single string to search.

But if you are OK with O((list2.Count*log(list2.Count) + list1.Count*log(list1.Count))*AverageStrLength), I can propose my simple implementation: Sort both lists. Then go along them simultaneously and select matches.

UPDATE: This implementation is fine when you have already sorted lists. Then it executes in about 100ms. But sorting takes 3.5 sec itself, so implementation isn't as good as I expected at first. As for simple implementation, Tim's solution is faster.

list1.Sort();
list2.Sort();
var result = new List<string>();
for(int i=0, j=0; i<list1.Count; i++)
{
    var l1Val = list1[i];
    for (; j < list2.Count && string.CompareOrdinal(l1Val, list2[j]) > 0; j++) ;
    for (; j < list2.Count && list2[j].StartsWith(l1Val); j++)
    {
        result.Add(list2[j]);
    }
}

That's the most simple efficient implementation I can propose.

edited Mar 06 '13 at 12:52

answered Mar 06 '13 at 12:21

Sasha

8,537
4
49
76

This "most simple efficient implementation" is 6 times slower than my ["very slow" approach](http://stackoverflow.com/a/15246619/284240) ;) Result: "Elapsed in 3888,1323 ms. 52896 matches found" – Tim Schmelter Mar 06 '13 at 12:36
Ouch, sorry, didn't calculate the Sorts efficiency at first. It's fast when the lists are already sorted =). You are right, in total it's almost 4 sec =(. – Sasha Mar 06 '13 at 12:50
this StartsWith or EndWith or Contains all are the same and they are very slow in huge numbers. Tim Schmelter's way ran on my laptop in several times with an average of 750ms and that it fantastic! – Hossein Amini Mar 06 '13 at 13:46
All string operations are comparatively slow. As for StartsWith vs Contains, the first one is much faster for long strings. As I written in the UPDATE section, the loop executes fast, but I underestimated the sorting time. It would be faster to use hashing instead, but Tim's solution does it in very simple and efficient way. I voted for him =). My solution may have usage only if your lists change often and you can support their sorting order and traverse then using more complicated dataStructures then simple list. But it looks like this is not the case now. – Sasha Mar 06 '13 at 14:00

Fast way of extraction of a list based on another list

3 Answers3

Linked