2

There are 10 mil records of two columns keyword and string.

I want to find rows where the keyword column appears in its string column:

test_df=pd.DataFrame({'keyword1':['day','night','internet','day','night','internet'],'string1':['today is a good day','I like this','youtube','sunday','what is this','internet']})
test_df

enter image description here

my first attempt is use .apply, but it's slow.

test_df[test_df.apply(lambda x: True if x['keyword1'] in x['string1'] else False,axis=1)]

because there are 10mil different strings, but much small number of keywords(in the magnitude of 10 thousands). So I'm thinking maybe it's more efficient if I group it by keywords.

test_df.groupby('keyword1',group_keys=False).apply(lambda x: x[x['string1'].str.contains(x.loc[x.index[0],'keyword1'])])

Supposedly, this approach only has 10k iteration rather than 10m iteration. But it is only slightly faster(10%). I'm not sure why? overhead of iteration is small, or groupby has its additional cost.

My question is: is there a better way to perform this job?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Yuan Ren
  • 285
  • 3
  • 8
  • pls refer to the link below. Maybe this can help https://stackoverflow.com/questions/40861341/extracting-sentences-using-pandas-with-specific-words – Balkar Singh Aug 19 '20 at 06:23
  • thanks, it's a very interesting post. But I think my use case is a little simpler than that, right now I'm not thinking about using NLP for this. – Yuan Ren Aug 19 '20 at 10:01

1 Answers1

0

One idea is create mask by GroupBy.transform and compare by x.name, also regex=False should improved performance, but here seems still a lot of groups (10k), so groupby is bottleneck here:

mask = (test_df.groupby('keyword1')['string1']
               .transform(lambda x : x.str.contains(x.name, regex=False)))

df = test_df[mask]
print (df)
   keyword1              string1
0       day  today is a good day
3       day               sunday
5  internet             internet

Another idea is using list comprehension, but not sure if faster in 10M:

test_df[[x in y for x, y in test_df[['keyword1','string1']].to_numpy()]]

Some tests with sample data, but here are only few groups, so groupby really fast:

#6k data
test_df = pd.concat([test_df] * 1000, ignore_index=True)

In [49]: %timeit test_df[test_df.groupby('keyword1', sort=False)['string1'].transform(lambda x :x.str.contains(x.name, regex=False))]
5.84 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [50]: %timeit test_df[[x in y for x, y in test_df[['keyword1','string1']].to_numpy()]]
9.46 ms ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [51]: %timeit test_df.groupby('keyword1',group_keys=False).apply(lambda x: x[x['string1'].str.contains(x.loc[x.index[0],'keyword1'])])
11.7 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [52]: %timeit test_df[test_df.apply(lambda x: True if x['keyword1'] in x['string1'] else False,axis=1)]
138 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks. so it seems groupby.transform is faster than groupby.apply. In my actual application, I have two keywords and containing either of them is considered a match. so my final code looks like this: df[df.groupby(['keyword1','keyword2'])['string'].transform(lambda x:x.str.contains(x.name[0],regex=False) | x.str.contains(x.name[1],regex=False))]. I compared if to the original methods, for 10 mil rows, it's 183s vs 220s and 312s. A meaningful improvement, but still takes 3 mins. Probably because there are a lot of groups. – Yuan Ren Aug 19 '20 at 10:10
  • @YuanRen - I agree, there is still a lot of groups. – jezrael Aug 19 '20 at 10:12
  • @YuanRen - there should be one idea - try [Aho-Corasick algorithm](https://stackoverflow.com/a/48600345) – jezrael Aug 19 '20 at 10:26