I have a pandas dataframe of shape (142000, 1) with a column named keywords where each cell contains a list of keywords.
I want to check which rows have at least one equal keyword.
for i in combinations(list(range(len(df.index))), 2):
if set(df['keywords'][i[0]]) & set(df['keywords'][i[1]]):
do_something() # this runs reasonably fast, no problem here
The set thing works as follow: set([1,2,3]) & set([3,4,5]) = {3}
. So it's really just to check if the lists share any item.
The problem is bruteforcing it because we have 142000!/[(142000 - 2)!2!] iterations in total.
Is there a better way to do this?