Search a list of list of strings for a list of strings in python efficiently

Question

I have a list of list of strings and a list of strings. for example:

L1=[["cat","dog","apple"],["orange","green","red"]]
L2=["cat","red"]

if L1[i] contains any item from L2 I need to put the pairs (for creating edges in a graph) like, in my example, I need the pairs ("cat","dog"),("cat,apple"),("red,orange"),("red","green")

What approach should I use to make it most efficient. (My list L1 is huge)

Did you try it straight-forward (and maybe less efficient)? – alex Feb 16 '12 at 18:35 — alex, Feb 16 '12 at 18:35

score 2 · Accepted Answer · answered Feb 16 '12 at 19:16

Supposing that you might have, more than one "control" item in your L1 sublists.

I'd do it using set() and itertools.product():

from itertools import product

def generate_edges(iterable, control):
    edges = []
    control_set = set(control)
    for e in iterable:
        e_set = set(e)
        common = e_set & control_set
        to_pair = e_set - common
        edges.extend(product(to_pair, common))
    return edges

Example:

>>> L1 = [["cat","dog","apple"],
...       ["orange","green","red"],
...       ["hand","cat","red"]]
>>> L2 = ["cat","red"]
>>> generate_edges(L1, L2)
[('apple', 'cat'),
 ('dog', 'cat'),
 ('orange', 'red'),
 ('green', 'red'),
 ('hand', 'red'),
 ('hand', 'cat')]

score 1 · Answer 2 · answered Feb 16 '12 at 18:36

I'd suggest transforming them all to sets and using set operations (intersection) to figure out what terms from L2 are in each L1 item. You can then use set subtraction to get the list of items you need to pair.

edges = []
L2set = set(L2)
for L1item in L1:
    L1set = set(L1item)
    items_in_L1item = L1set & L2set
    for item in items_in_L1item:
        items_to_pair = L1set - set([item])
        edges.extend((item, i) for i in items_to_pair)

juliomalegria · Answer 3 · 2012-02-16T18:42:19.483

1

To make this code optimal even if L1 and L2 are huge, use izip that produces a generator instead of creating a huge list of tuples. If you're working in Python3, just use zip.

from itertools import izip

pairs = []
for my_list, elem in izip(L1, L2):
    if elem in my_list:
        pairs += [(elem, e) for e in my_list if e!=elem]
print pairs

The code is very comprehesible, it's almost pure english! First, you're looping over each list and its corresponding element, then you're asking if the element is inside the list, if it is, print all pairs except the pair (x, x).

Output:

[('cat', 'dog'), ('cat', 'apple'), ('red', 'orange'), ('red', 'green')]

edited Feb 16 '12 at 18:42

answered Feb 16 '12 at 18:37

juliomalegria

24,229
14
73
89

Your code does not work for the general case the OP is looking for. – Amber Feb 16 '12 at 18:41
Because any of the elements from L2 may be in any of the lists from L1, and L1 may not have the same number of lists in it as L2 has elements. – Amber Feb 17 '12 at 04:51

score 1 · Answer 4 · answered Feb 16 '12 at 20:08

If L1 is very large you might want to look into using bisect. It requires that yo flatten and sort L1 first. You could do something like:

from bisect import bisect_left, bisect_right
from itertools import chain

L1=[["cat","dog","apple"],["orange","green","red","apple"]]
L2=["apple", "cat","red"]

M1 = [[i]*len(j) for i, j in enumerate(L1)]
M1 = list(chain(*M1))
L1flat = list(chain(*L1))
I = sorted(range(len(L1flat)), key=L1flat.__getitem__)
L1flat = [L1flat[i] for i in I]
M1 = [M1[i] for i in I]

for item in L2:
    s = bisect_left(L1flat, item)
    e = bisect_right(L1flat, item)
    print item, M1[s:e]

#apple [0, 1]
#cat [0]
#red [1]

Search a list of list of strings for a list of strings in python efficiently

4 Answers4

Linked