8

I know that this topic has been addressed a thousand times. But I can't figure out a solution.

I'm trying to count how often a list (each row of df1.list1) occurs in a column of list (df2.list2). All lists consist of unique values only. List1 includes about 300.000 rows and list2 30.000 rows.

I've got a working code but its terribly slow (because I'm using iterrows). I also tried itertuples() but it gave me an error ("too many values to unpack (expected 2)"). I found a similar question online: Pandas counting occurrence of list contained in column of lists. In the mentioned case the person considers only the occurrence of one list within a column of lists. However, I can't work things out so each row in df1.list1 is compared to df2.list2.

Thats how my lists look like (simplified):

df1.list1

0   ["a", "b"]
1   ["a", "c"]
2   ["a", "d"]
3   ["b", "c"]
4   ["b", "d"]
5   ["c", "d"]


df2.list2

0    ["a", "b" ,"c", "d"]
1    ["a", "b"] 
2    ["b", "c"]
3    ["c", "d"]
4    ["b", "c"]

What I would like to come up with:

df1

    list1         occurence   
0   ["a", "b"]    2
1   ["a", "c"]    1
2   ["a", "d"]    1
3   ["b", "c"]    3
4   ["b", "d"]    1
5   ["c", "d"]    2

Thats what I've got so far:

for index, row in df_combinations.iterrows():
    df1.at[index, "occurrence"] = df2["list2"].apply(lambda x: all(i in x for i in row['list1'])).sum()

Any suggestions how I can speed things up? Thanks in advance!

Hannah
  • 329
  • 2
  • 3
  • 8
  • Where are your lists coming from? Can you sort them during construction? – Jack Aidley Jul 03 '18 at 08:24
  • I constructed list1, list2 was already given. I could sort them if its easier to handle. – Hannah Jul 03 '18 at 08:31
  • With sorted lists you should be able to do this much, much faster. The question is whether sorting the lists is going to cost you more time than it saves in the faster comparison. – Jack Aidley Jul 03 '18 at 08:58

1 Answers1

6

This should be much faster:

df = pd.DataFrame({'list1': [["a","b"],
                             ["a","c"],
                             ["a","d"],
                             ["b","c"],
                             ["b","d"],
                             ["c","d"]]*100})
df2 = pd.DataFrame({'list2': [["a","b","c","d"],
                              ["a","b"], 
                              ["b","c"],
                              ["c","d"],
                              ["b","c"]]*100})

list2 = df2['list2'].map(set).tolist()

df['occurance'] = df['list1'].apply(set).apply(lambda x: len([i for i in list2 if x.issubset(i)]))

Using your approach:

%timeit for index, row in df.iterrows(): df.at[index, "occurrence"] = df2["list2"].apply(lambda x: all(i in x for i in row['list1'])).sum()

1 loop, best of 3: 3.98 s per loop Using mine:

%timeit list2 = df2['list2'].map(set).tolist();df['occurance'] = df['list1'].apply(set).apply(lambda x: len([i for i in list2 if x.issubset(i)]))

10 loops, best of 3: 29.7 ms per loop

Notice that I've increased the size of list by a factor of 100.

EDIT

This one seems even faster:

list2 = df2['list2'].sort_values().tolist()
df['occurance'] = df['list1'].apply(lambda x: len(list(next(iter(())) if not all(i in list2 for i in x) else i for i in x)))

And timing:

%timeit list2 =  df2['list2'].sort_values().tolist();df['occurance'] = df['list1'].apply(lambda x: len(list(next(iter(())) if not all(i in list2 for i in x) else i for i in x)))

100 loops, best of 3: 14.8 ms per loop

zipa
  • 27,316
  • 6
  • 40
  • 58