6

I'm trying to build an algorithm in Python to filter a large block of RDF data.

I have one list consisting of about 70 thousand items formatted like <"datum">.

I then have about 6GB worth of items (triples) formatted like <"A"> <"B"> <"C">

I want to extract all the triples that contain any item in the first list, and then extract any triples that contain any individual item from the first extraction (the net effect is to form a partition of the graph that's connected by one step to the seeds from the first list).

I haven't been able to come up with a great algorithm for this (not helped by the fact that I have no formal CS training.)

The best I've come up with so far is to start by splitting the triples in the big list into a list of three item lists [<"A">, <"B">, <"C">]. I then split that into chunks, and use multiprocessing to create processes that take the full small list and a chunk of the big list and...

for line in big list:
    for item in small list:
      if item in line:
       bucket.append(line)

This algorithm takes quite a while.

Is there any faster way to do this? If there's a specific algorithm, you can just give me the name and I'll figure out how to implement it.

Thanks!

Clarifications per comments:

  1. All the data items are strings. So small list might contain ["Mickey", "Mouse", "Minny", "Cat"] and big list might be [["Mickey","Pluto","Bluto"],["John", "Jane", "Jim]...]

  2. Only one item in each big list triple needs to match an item for the small list for that to count

  3. All of the items in the small list are actually unique, so I didn't think to convert them to a set anyway. But I will try that.

  4. I can create whatever intermediate structures I want. I'm experimenting with an inverted index constructed using a shelve right now.

jamylak
  • 128,818
  • 30
  • 231
  • 230
rogueleaderr
  • 4,671
  • 2
  • 33
  • 40

1 Answers1

5

You should probably first store the small list in a set, so lookup is faster. This prevents going through 70,000 iterations for every item in big_list.

small_list_set = set(small_list)
for line in big_list:
    for item in line:
        if item in small_list_set:
            bucket.append(line)
happydave
  • 7,127
  • 1
  • 26
  • 25
  • 1
    Very good suggestion. This will likely be much, much faster, because lookups in a well-implemented `set` (using hashed keys) are `O(1)` time, as opposed to a search through a list in `O(n)` time. – Li-aung Yip May 03 '12 at 02:51
  • 1
    Note that this (like the OP's code) will append `line` multiple times if there are multiple matches, which maybe isn't desired (I'm not clear on exactly what filtering is wanted). This could be easily avoided by adding a `break` after `bucket.append(line)`. – Danica May 03 '12 at 02:57
  • I agree - I really wasn't clear on exactly what the OP wanted either. The main suggestion was to use a set to reduce run-time by a factor of 70,000 or so. – happydave May 03 '12 at 02:58
  • Even though the items in the small list are already unique, using set() does go MUCH faster. I think I'm ultimately going to go with building an inverted index, because it makes it easy to do the second step of finding triples that contain an item from any of the triples matching the first pass. Building the index takes pretty long time, but once it's build the lookup is very fast. – rogueleaderr May 03 '12 at 03:41
  • 2
    @rogueleaderr: Here we're using a `set` not because elements in it are guaranteed to be unique, which is one property of a `set`, but because lookups in it are much faster. (This is **possible** because each element can only occur once.) – Li-aung Yip May 03 '12 at 03:44