0

I have a pandas dataframe of shape (142000, 1) with a column named keywords where each cell contains a list of keywords.

I want to check which rows have at least one equal keyword.

for i in combinations(list(range(len(df.index))), 2):
    if set(df['keywords'][i[0]]) & set(df['keywords'][i[1]]):
        do_something() # this runs reasonably fast, no problem here

The set thing works as follow: set([1,2,3]) & set([3,4,5]) = {3}. So it's really just to check if the lists share any item.

The problem is bruteforcing it because we have 142000!/[(142000 - 2)!2!] iterations in total.

Is there a better way to do this?

Ry-
  • 218,210
  • 55
  • 464
  • 476
  • How long are the lists of keywords? And does the result need to be a list of pairs, or how is it going to be used? – Ry- Jul 26 '20 at 11:07
  • So you basically have a list of lists and want to know which of those sublists have at least one keyword in common? If so it's definitely doable in O(n). – Marco Bonelli Jul 26 '20 at 11:08
  • @Ry- The lists have all the same length of 100. From the results I'll be using i[0] and i[1]. – José Carlo Jul 26 '20 at 11:20
  • ``set.issubset`` would work, see https://stackoverflow.com/questions/16579085/how-can-i-verify-if-one-list-is-a-subset-of-another – LoschmidtsSchnitzel Jul 26 '20 at 11:21
  • Yeah, but for example if all of them intersected, you’d still have 142000×141999 pairs. Does what you’re doing with the pairs specifically require the pair format? (Or is that kind of thing not an issue on the data you’re working with and there’s no need to worry about malicious data?) – Ry- Jul 26 '20 at 11:23
  • @MarcoBonelli I guess you could say so. I wonder if the pandas `isin` method could be of any use here. – José Carlo Jul 26 '20 at 11:23
  • @Ry- It does require a pair format. I'm creating a network, actually. Using [graph-tool](https://graph-tool.skewed.de/) – José Carlo Jul 26 '20 at 11:27

1 Answers1

1

Create an index from keywords to a set of all indices at which that keyword appears (I'm not super familiar with Pandas so you may need to fix some things):

keyword_index = defaultdict(set)
for i, keywords in enumerate(df['keywords']):
    for keyword in keywords:
        keyword_index[keyword].add(i)

Then loop through the index and do something for all keywords that appear at multiple indices:

for keyword, indices in keyword_index.items():
    if len(indices) >= 2:
        do_something()

You'll have to decide how to handle keywords that appear in more than two rows. If you want to handle every combination separately, it's still worst-case O(n^2) as in your original code.

Thomas
  • 174,939
  • 50
  • 355
  • 478