-1

I have two dataframes named codes and phrases

codes :

code keywords
bg burger
bg burgers
cbg chicken burger
cbg burger chicken
cbg chicken burgers
-- --
-- --

phrases :

text
burgers near me
chicken burgers around NYC
--
--

Using python I want to build a dataframe like this :

text code
burgers near me bg
chicken burgers around NYC cbg
-- --
-- --

I am trying to identify which keywords from codes best match with each record of phrases.

If I simply use string contains function, burgers would match with both the phrases above. Is there a better way to accomplish this?

Thanks in advance!

1 Answers1

0

You can add a column to codes with the length of each keyword. Then start assigning the largest number of characters first. With each iteration, calculate a new index to find the remaining blanks and the matches so that only those are filled.

phrases['code'] = ''
codes['Length'] = codes.keywords.str.len()
codes = codes.sort_values('Length', ascending=False)

for _, row in codes.iterrows()
    ix_blank = phrases.code.eq('')
    ix_match = phrases.text.str.contains(f'\\b{row.keywords}\\b')
    phrases.loc[ix_blank & ix_match, 'code'] = row.code
James
  • 32,991
  • 4
  • 47
  • 70