1

I have a large dataframe with text that I want to use to find matches from a list of words (around 1k words in there).

I have managed to get the absence/presence of the word from the list in the dataframe, but it is also important to me to know which word matched. Sometimes there is exact match with more than one word from the list, I would like to have them all.

I tried to use the code below, but it gives me partial matches - syllables instead of full words.

#this is a code to recreate the initial DF

import pandas as pd

df_data= [['orange','0'],
['apple and lemon','1'],
['lemon and orange','1']]

df= pd.DataFrame(df_data,columns=['text','match','exact word'])

Initial DF:

 text                 match
 orange               0
 apple and lemon      1
 lemon and orange     1

This is the list of words I need to match

 exactmatch = ['apple', 'lemon']

Expected result:

 text                    match  exact words
 orange                    0         0 
 apple and lemon           1        'apple','lemon'
 lemon and orange          1        'lemon'

This is what I've tried:

# for some rows it gives me words I want, 
#and for some it gives me parts of the word

#regex attempt 1, gives me partial matches (syllables or single letters)

pattern1 = '|'.join(exactmatch)
df['contains'] = df['text'].str.extract("(" + "|".join(exactmatch) 
+")", expand=False)

#regex attempt 2 - this gives me an error - unexpected EOL

df['contains'] = df['text'].str.extractall
("(" + "|".join(exactmatch) +")").unstack().apply(','.join, 1)

#TypeError: ('sequence item 1: expected str instance, float found', 
#'occurred at index 2')

#no regex attempt, does not give me matches if the word is in there

lst = list(df['text'])
match = []
for w in lst:
 if w in exactmatch:
    match.append(w)
    break
alinaz
  • 149
  • 1
  • 9

2 Answers2

5

Use str.findall

Ex:

exactmatch = ['apple', 'lemon']
df_data= [['orange'],['apple and lemon',],['lemon and orange'],]

df= pd.DataFrame(df_data,columns=['text'])
df['exact word'] = df["text"].str.findall(r"|".join(exactmatch)).apply(", ".join)
print(df)

Output:

               text    exact word
0            orange              
1   apple and lemon  apple, lemon
2  lemon and orange         lemon
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • Thanks! It works, but in addition to giving me full matches it also gives me syllable matches in a bigger dataset. E.g.: one of the matches looks like this "a, la, et, identify, la, are, la, ideology, ...". I need the words 'identify' and 'ideology' because they are in my list, but I'm not sure how to eliminate the partial matches (letter combinations). – alinaz Jul 22 '19 at 14:21
  • 1
    Looks like you need regex boundaries \b – Rakesh Jul 22 '19 at 14:23
  • thanks :) Could you please help me and show where I should put them? – alinaz Jul 22 '19 at 14:44
  • 1
    ex `str.findall(r"\b"+"|".join(exactmatch) + r"\b")` – Rakesh Jul 22 '19 at 15:24
  • @Rakesh seems like the regex boundaries still gave the same result like what alinaz mentioned – Jimmy Nov 02 '20 at 19:14
0

The problem of matching some word(s) as "exact" words or match is not a simple regex task. The final solution depends on your concrete use case, what you mean by "exact" in each specific scenario.

You need to build a pattern dynamically from the list of words using one of the ways described in Match a whole word in a string using dynamic regex or Word boundary with words starting or ending with special characters gives unexpected results.

Then, you can simply use Series.str.findall without worrying about whether your pattern contains a capturing group or not:

df = pd.DataFrame({'text':['orange','apple and lemon', 'lemon and orange'], 'match':['0','1','1']})
exactmatch = ['apple', 'lemon']
pattern = fr'\b({"|".join(exactmatch)})\b' # This works for words consisting of letters, digits or underscores
df['exact word'] = df['text'].str.findall(pattern).str.join(", ")
# => >>> df
# =>                text match    exact word
# => 0            orange     0              
# => 1   apple and lemon     1  apple, lemon
# => 2  lemon and orange     1         lemon

If you need to rely on exact match but not \b word boundaries:

  • Full string match: fr'^({"|".join([re.escape(word) for word in exactmatch])})\Z' (this is the weirdest case for .findall, Series.str.extract makes more sense, and even non-regex approaches must be considered here, like .isin)
  • Word boundaries with longest match support when words can contain special chars inside words and overlapping terms (extract sour lemon from I have a sour lemon when the words are ['sour', 'lemon', 'sour lemon']): pattern = fr'\b({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})\b'
  • Whitespace boundaries (match occurs between whitespaces or whitespaces and start/end of string: pattern = fr'(?<!\S)({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?!\S)'
  • Unambiguous word boundaries (no match in between word - letter, digit, underscore - chars: pattern = fr'(?<!\w)({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?!\w)'
  • Unambiguous word boundaries with underscores subtracted (no match in between letters or digits, but _lemon_ is a case of an exact lemon word): pattern = fr'(?<![^\W_])({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?![^\W_])'
  • Letter boundaries (no match in between letters, but _lemon_ and 0lemon1 are cases of an exact lemon word): pattern = fr'(?<![^\W\d_])({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?![^\W\d_])'
  • Adaptive dynamic word boundaries Type 1 (when you have no control over the words to match, and they can contain special chars anywhere, no special context restriction for intial and trailing special chars): pattern = fr'(?:(?!\w)|\b(?=\w))({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?:(?<=\w)\b|(?<!\w))'
  • Adaptive dynamic word boundaries Type 2 (when you have no control over the words to match, and they can contain special chars anywhere, and if there are special chars at the start or end of the word, no other word char can appear right next to it): pattern = fr'(?:\B(?!\w)|\b(?=\w))({"|".join([re.escape(word) for word in sorted(exactmatch, key=len, reverse=False)])})(?:(?<=\w)\b|(?<!\w)\B)'.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563