Consider the following snippet:
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
# 20,00,000 such rows
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]
Given a huge dataframe and 2 lists, I want to count the number of elements in new_list that are in the same in the dataframe. In the above pseudo example, The result would be 3 as: "aaa-fff", "ccc-ggg", & "ddd-ccc" are in the same row of the dataframe.
Right now, I am using a linear search algorithm but it is very slow as I have to scan through the entire dataframe.
df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
c1 = 0
for b in list_b:
str1=a+"-"+b
str2=b+"-"+a
str1=a+"-"+b
c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
c1+=c2
return c1
Can someone kindly help me implement a faster algorithm preferably with a dictionary data structure?
Note: I have to iterate through the 7,000 rows of another dataframe and create the 2 lists dynamically, and get an aggregate count for each row.