There are 10 mil records of two columns keyword
and string
.
I want to find rows where the keyword
column appears in its string
column:
test_df=pd.DataFrame({'keyword1':['day','night','internet','day','night','internet'],'string1':['today is a good day','I like this','youtube','sunday','what is this','internet']})
test_df
my first attempt is use .apply, but it's slow.
test_df[test_df.apply(lambda x: True if x['keyword1'] in x['string1'] else False,axis=1)]
because there are 10mil different strings, but much small number of keywords(in the magnitude of 10 thousands). So I'm thinking maybe it's more efficient if I group it by keywords.
test_df.groupby('keyword1',group_keys=False).apply(lambda x: x[x['string1'].str.contains(x.loc[x.index[0],'keyword1'])])
Supposedly, this approach only has 10k iteration rather than 10m iteration. But it is only slightly faster(10%). I'm not sure why? overhead of iteration is small, or groupby has its additional cost.
My question is: is there a better way to perform this job?