0

I have referenced the following post and it was extremely helpful, but I need to take it a step further. Python - Searching a string within a dataframe from a list

I would like to not only search my data frame for a list of words, but also keep track of if multiple words are found and the frequency. So, using the example from the above post:

If this is my search list

search_list = ['STEEL','IRON','GOLD','SILVER']

and this is the data frame I am searching in

      a    b             
0    123   'Blah Blah Steel'
1    456   'Blah Blah Blah Steel Gold'
2    789   'Blah Blah Gold'
3    790   'Blah Blah blah'

I want my output to be

      a    b                        c               d
0    123   'Blah Blah Steel'      'STEEL'           1
1    789   'Blah Blah Steel Gold' 'STEEL','GOLD'    2
2    789   'Blah Blah Gold'       'GOLD'            1
3    790   'Blah Blah blah'

How may I expand on the awesome solutions in the above mentioned post to get this desired output? I am currently utilizing the top voted answer as a starting place.

I am more concerned with being able to tag multiple words from the list. I have not found any way to do this yet. I can apply string counting functions to the data frame to create a frequency column if these is no way to do that in this step. If there is a way to do it all in one step though that would be good to know as well.

Thanks in advance!

CJJ
  • 75
  • 9

2 Answers2

2

You can use re.findall() instead of extract() to do what you need.

import re

search_list = ['STEEL','IRON','GOLD','SILVER']

df['c'] = df.b.str.findall('({0})'.format('|'.join(search_list)), flags=re.IGNORECASE)
df['d'] = df['c'].str.len()

This output looks like this:

enter image description here

Akanksha Atrey
  • 780
  • 4
  • 8
  • Great solution this works ! I had looked into re, but couldn't quite get it working. Thank you ! – CJJ Oct 01 '20 at 13:55
  • I am seeing some output that looks like this: [(' Steel', '', '', '', '', ''), ('Gold', '', '', '', '', '')]. It does not appear to be consistent so I can't quite figure out the cause. Any idea why this may be happening? – CJJ Oct 05 '20 at 19:01
  • That is weird. Do you have any examples of that (i.e. what is a, b, and search_list when you see c=[(' Steel', '', '', '', '', ''), ('Gold', '', '', '', '', '')])? – Akanksha Atrey Oct 05 '20 at 21:22
  • @Akanksha-Is there a character limit when using findall(). I am noticing some of the words in my search list not being flagged, and then tend to be longer in length. Not sure if this is just coincidence. Does some of the terms in my search list actually being phrases of multiple words have any effect on the ability findall() to find them in my dataframe? – CJJ Oct 06 '20 at 21:08
  • 1
    Interesting. I don't think there is a character limit. From a quick search, it seems like this is because the keywords are defined as capturing groups. It will return empty matches for the non-matching expressions. Check out https://stackoverflow.com/questions/55513299/python-regex-findall-returns-empty-string-when-not-asked and https://stackoverflow.com/questions/45876194/python-regex-or-gives-empty-string-when-using-findall. I would recommend opening up a new question for it if the recommended solutions don't work. – Akanksha Atrey Oct 07 '20 at 02:55
  • 1
    In case you were curious-Turns out the issue was that some words/phrases in the search list had parentheses. Not sure why parentheses throws it off, but other special characters such as hyphens do not or why the issue was not consistent, but removing the parentheses solved the issue. – CJJ Oct 15 '20 at 14:09
  • Awesome, glad you got it sorted! – Akanksha Atrey Oct 15 '20 at 18:04
1
#turn column b into a list of uppercases
  df.b=df.b.str.upper().str.split('\s')

#Because you have two lists, use the apply function to turn them into sets
#..and leverage the rich membership functions encased in sets.
# Using intersection, you will find items in each list. 
#Then use list.str.len() to count.

df=df.assign(c=df.b.apply(lambda x:[*{*x}&{*search_list}])\
.str.join(','),d=df.b.apply(lambda \
x:[*{*x}&{*search_list}]).str.len())



                       b           c      d
0        [BLAH, BLAH, STEEL]       STEEL  1
1  [BLAH, BLAH, STEEL, GOLD]  GOLD,STEEL  2
2         [BLAH, BLAH, GOLD]        GOLD  1
3         [BLAH, BLAH, BLAH]              0
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Is it possible to utilize/modify this solution so that column b is not broken up into individual words? I ask because some of the words in my search list are actually phrases and so those would not get captured in that case. – CJJ Oct 06 '20 at 00:58