Fastest algorithm for finding overlap between two very large lists?

Question

I'm trying to build an algorithm in Python to filter a large block of RDF data.

I have one list consisting of about 70 thousand items formatted like <"datum">.

I then have about 6GB worth of items (triples) formatted like <"A"> <"B"> <"C">

I want to extract all the triples that contain any item in the first list, and then extract any triples that contain any individual item from the first extraction (the net effect is to form a partition of the graph that's connected by one step to the seeds from the first list).

I haven't been able to come up with a great algorithm for this (not helped by the fact that I have no formal CS training.)

The best I've come up with so far is to start by splitting the triples in the big list into a list of three item lists [<"A">, <"B">, <"C">]. I then split that into chunks, and use multiprocessing to create processes that take the full small list and a chunk of the big list and...

for line in big list:
    for item in small list:
      if item in line:
       bucket.append(line)

This algorithm takes quite a while.

Is there any faster way to do this? If there's a specific algorithm, you can just give me the name and I'll figure out how to implement it.

Thanks!

Clarifications per comments:

All the data items are strings. So small list might contain ["Mickey", "Mouse", "Minny", "Cat"] and big list might be [["Mickey","Pluto","Bluto"],["John", "Jane", "Jim]...]
Only one item in each big list triple needs to match an item for the small list for that to count
All of the items in the small list are actually unique, so I didn't think to convert them to a set anyway. But I will try that.
I can create whatever intermediate structures I want. I'm experimenting with an inverted index constructed using a shelve right now.

Are you allowed to construct intermediate structures on disk? It seems like you could benefit from 'inverted indices' i.e something like {'A': [('A', B', 'C), ('A', 'X', 'Y')], ...} — spinlok, May 03 '12 at 02:36
Just to be clear, what is the exact criteria for matching an entry at each stage? Do all of `` have to match? Or just one of ``, `` or ``? And the second stage of your filtering is a bit vague as well. Some example data might help here? — Li-aung Yip, May 03 '12 at 02:48
You should provide an abbreviated example of what the first list contains, and what you would like the resulting list to contain. — Joel Cornett, May 03 '12 at 03:06

score 5 · Accepted Answer · answered May 03 '12 at 02:37

5

You should probably first store the small list in a set, so lookup is faster. This prevents going through 70,000 iterations for every item in big_list.

small_list_set = set(small_list)
for line in big_list:
    for item in line:
        if item in small_list_set:
            bucket.append(line)

answered May 03 '12 at 02:37

happydave

7,127
1
26
25

1

Very good suggestion. This will likely be much, much faster, because lookups in a well-implemented `set` (using hashed keys) are `O(1)` time, as opposed to a search through a list in `O(n)` time. – Li-aung Yip May 03 '12 at 02:51
1

Note that this (like the OP's code) will append `line` multiple times if there are multiple matches, which maybe isn't desired (I'm not clear on exactly what filtering is wanted). This could be easily avoided by adding a `break` after `bucket.append(line)`. – Danica May 03 '12 at 02:57
I agree - I really wasn't clear on exactly what the OP wanted either. The main suggestion was to use a set to reduce run-time by a factor of 70,000 or so. – happydave May 03 '12 at 02:58
Even though the items in the small list are already unique, using set() does go MUCH faster. I think I'm ultimately going to go with building an inverted index, because it makes it easy to do the second step of finding triples that contain an item from any of the triples matching the first pass. Building the index takes pretty long time, but once it's build the lookup is very fast. – rogueleaderr May 03 '12 at 03:41
2

@rogueleaderr: Here we're using a `set` not because elements in it are guaranteed to be unique, which is one property of a `set`, but because lookups in it are much faster. (This is **possible** because each element can only occur once.) – Li-aung Yip May 03 '12 at 03:44

Fastest algorithm for finding overlap between two very large lists?

1 Answers1

Linked