Different counting problem by upper/lower case in list matching

Question

    Name Text
0   K    IeatApple
1   Y    bananaisdelicious
2   B    orangelikesomething 
3   Q    blueBanana
4   C    appleislike

I want to match the 'text' column and list in the data frame.

However, there is no distinction between lowercase and uppercase letters in the'text' column. So, to capture all of them, the list was changed to a regex as follows.

mylist = [apple, banana]
mylist = [f"(?i){re.escape(k)}" for k in mylist]

#contain matching list - column
extracted = df['text'].str.findall(f'({"|".join(mylist)})').apply(set)

#Matched words are added to the data frame as column.
df['matching'] = extracted.str.join(',')

#keyword counting
s = pd.DataFrame(extracted.tolist()).stack().value_counts()
print(s)

Apple 1
Banana 1
banana 1
apple 1

One problem with doing this is that it recognizes'apple' and 'Apple' differently.

Is there a way to match both the upper and lower case letters and spell the same word?

Untested: `extracted = df['text'].str.findall(f'({"|".join(mylist)})', regex=True, case=False).apply(set)`. This probably has a dup question somewhere on SO though =) — JvdV, Jul 21 '20 at 07:32
@JvdV got an error : `str_findall() got an unexpected keyword argument 'regex'` — ybin, Jul 21 '20 at 07:39

score 2 · Accepted Answer · answered Jul 21 '20 at 07:32

2

One idea is convert values to lowercase:

mylist = [apple, banana]
mylist = [f"(?i){re.escape(k.lower())}" for k in mylist]

extracted = df['text'].str.lower().str.findall(f'({"|".join(mylist)})').apply(set)

answered Jul 21 '20 at 07:32

jezrael

822,522
95
1,334
1,252

Ah, is it a way to change the 'text' to lower and apply the 'list' to lower?. – ybin Jul 21 '20 at 07:39
@ybin - edited answer, first 2 rows. – jezrael Jul 21 '20 at 07:39
In my real data, not only English but also Korean is mixed, does it matter whether it is applied? I'll try it once!. – ybin Jul 21 '20 at 07:43
1

@Ch3steR - it match, but output is combination lowercase, uppercase. – jezrael Jul 21 '20 at 07:51
1

@ybin - I guess it should working with all strings, but the best test it. – jezrael Jul 21 '20 at 07:52
1

ahh yes, didnot notice that. +1 – Ch3steR Jul 21 '20 at 07:53
1

It works well, thanks always, Joži. – ybin Jul 21 '20 at 08:45

Different counting problem by upper/lower case in list matching

1 Answers1