-1
    Name Text
0   K    IeatApple
1   Y    bananaisdelicious
2   B    orangelikesomething 
3   Q    blueBanana
4   C    appleislike

I want to match the 'text' column and list in the data frame.

However, there is no distinction between lowercase and uppercase letters in the'text' column. So, to capture all of them, the list was changed to a regex as follows.

mylist = [apple, banana]
mylist = [f"(?i){re.escape(k)}" for k in mylist]
#contain matching list - column
extracted = df['text'].str.findall(f'({"|".join(mylist)})').apply(set)

#Matched words are added to the data frame as column.
df['matching'] = extracted.str.join(',')

#keyword counting
s = pd.DataFrame(extracted.tolist()).stack().value_counts()
print(s)

Apple 1
Banana 1
banana 1
apple 1

One problem with doing this is that it recognizes'apple' and 'Apple' differently.

Is there a way to match both the upper and lower case letters and spell the same word?

ybin
  • 555
  • 1
  • 3
  • 13
  • 1
    Untested: `extracted = df['text'].str.findall(f'({"|".join(mylist)})', regex=True, case=False).apply(set)`. This probably has a dup question somewhere on SO though =) – JvdV Jul 21 '20 at 07:32
  • @JvdV got an error : `str_findall() got an unexpected keyword argument 'regex'` – ybin Jul 21 '20 at 07:39

1 Answers1

2

One idea is convert values to lowercase:

mylist = [apple, banana]
mylist = [f"(?i){re.escape(k.lower())}" for k in mylist]

extracted = df['text'].str.lower().str.findall(f'({"|".join(mylist)})').apply(set)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252