How to use a dictionary to speed up the task of look up and counting?

Question

Consider the following snippet:

data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
       "col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
                                                # 20,00,000 such rows

list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]

# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]

Given a huge dataframe and 2 lists, I want to count the number of elements in new_list that are in the same in the dataframe. In the above pseudo example, The result would be 3 as: "aaa-fff", "ccc-ggg", & "ddd-ccc" are in the same row of the dataframe.

Right now, I am using a linear search algorithm but it is very slow as I have to scan through the entire dataframe.

df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
    c1 = 0
    for b in list_b:
        str1=a+"-"+b
        str2=b+"-"+a
        str1=a+"-"+b
        c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
    c1+=c2
return c1

Can someone kindly help me implement a faster algorithm preferably with a dictionary data structure?

Note: I have to iterate through the 7,000 rows of another dataframe and create the 2 lists dynamically, and get an aggregate count for each row.

Can you please time the answers? I'm om my phone – RichieV Aug 05 '20 at 00:01 — RichieV, Aug 05 '20 at 00:01
hi there, did any of the answers help? – RichieV Aug 12 '20 at 23:55 — RichieV, Aug 12 '20 at 23:55

jsmart · Answer 1 · 2020-08-06T15:52:14.470

Here is another way. First, I used your definition of df (with 2 columns), list_a and list_b.

# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']

# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} | 
     { f'{b}-{a}' for a, b in zip(list_a, list_b)})

# find intersection
result = set(df['col3']) & s
print(len(result), '\n', result)

3 
 {'ddd-ccc', 'ccc-ggg', 'aaa-fff'}

UPDATE to handle duplicate values.

# build list (not set) from list_a and list_b
idx =  ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
        [ f'{b}-{a}' for a, b in zip(list_a, list_b) ])

# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()

# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]

# results:
ddd-ccc    1
aaa-fff    1
ccc-ggg    1
Name: col3, dtype: int64

voted up before realizing that this option removes duplicates from the joined column before searching for matches... — RichieV, Aug 06 '20 at 15:31
Updated to address @RichieV's comment on duplicates. Converting to categorical data type (and then using categorical.codes) or using pd.factorize might reduce compute time (match on integer, not string) — jsmart, Aug 06 '20 at 16:00

score 0 · Answer 2 · answered Aug 04 '20 at 23:33

Try this:

from itertools import product

# all combinations of the two lists as tuples
all_list_combinations = list(product(list_a, list_b)) 

# tuples of the two columns
dftuples = [x for x in df.itertuples(index=False, name=None)] 

# take the length of hte intersection of the two sets and print it
print(len(set(dftuples).intersection(set(all_list_combinations))))

yields

3

RichieV · Answer 3 · 2020-08-06T15:35:28.297

First join the columns before looping, then instead of looping pass an optional regex to contains with all possible strings.

joined = df.col1+ '-' + df.col2
pat = '|'.join([f'({a}-{b})' for a in list_a for b in list_b] +
    [f'({b}-{a})' for a in list_a for b in list_b]) # substitute for itertools.product
ct = joined.str.contains(pat).sum()

To work with dicts instead of dataframes, you can use filter(re, joined) as in this question

import re

data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
   "col2":["fff","aaa","ggg","eee","ccc","ttt"]}
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]

### build the regex pattern
pat_set = set('-'.join(combo) for combo in set(
    list(itertools.product(list_a, list_b)) +
    list(itertools.product(list_b, list_a))))
pat = '|'.join(pat_set)
    # use itertools to generalize with many colums, remove duplicates with set()

### join the columns row-wise
joined = ['-'.join(row) for row in zip(*[vals for key, vals in data.items()])]

### filter joined
match_list = list(filter(re.compile(pat).match, joined))
ct = len(match_list)

Third option with series.isin() inspired by jsmart's answer

joined = df.col1 + '-' + df.col2
ct = joined.isin(pat_set).sum()

Speed testing

I repeated data 100,000 times for scalability testing. series.isin() takes the day, while jsmart's answer is fast but does not find all occurrences because it removes duplicates from joined

with dicts: 400000 matches, 1.00 s
with pandas: 400000 matches, 1.77 s
with series.isin(): 400000 matches, 0.39 s
with jsmart answer: 4 matches, 0.50 s

Since joined is super long, ct would take a lot of time in counting I suppose. I have to repeat this operation for 7,000 times and get 7,000 different count values. Thanks :) — Soumya Ranjan Sahoo, Aug 05 '20 at 08:47
how long are `list_a` and `list_b` ? this approach should be faster than looping even if there are many optional patterns because we are making a single python call instead of one call for every possible pattern... taking advantage of C implementations is the easiest way to speed up your python code — RichieV, Aug 05 '20 at 08:55
They are of variable lengths. The average length would be 10. Can we have a dictionary implementation of this problem? — Soumya Ranjan Sahoo, Aug 06 '20 at 07:07

How to use a dictionary to speed up the task of look up and counting?

3 Answers3