4

I have the following list:

search_list = ['STEEL','IRON','GOLD','SILVER']

which I need to search within a dataframe (df):

      a    b             
0    123   'Blah Blah Steel'
1    456   'Blah Blah Blah'
2    789   'Blah Blah Gold'

and insert the matching rows into a new dataframe (newdf), adding a new column with the matching word from the list:

      a    b                   c
0    123   'Blah Blah Steel'   'STEEL'
1    789   'Blah Blah Gold'    'GOLD'

I can use the following code to extract the matching row:

newdf=df[df['b'].str.upper().str.contains('|'.join(search_list),na=False)]

but I can't figure out how to add the matching word from the list into column c.

I'm thinking that the match somehow needs to capture the index of the matching word in the list and then pull the value using the index number but I can't figure out how to do this.

Any help or pointers would be greatly appreciated

Thanks

Big_Daz
  • 141
  • 1
  • 7

7 Answers7

6

You could use extract and filter out those that are nan (i.e. no match):

search_list = ['STEEL','IRON','GOLD','SILVER']

df['c'] = df.b.str.extract('({0})'.format('|'.join(search_list)), flags=re.IGNORECASE)
result = df[~pd.isna(df.c)]

print(result)

Output

              a       b      c
123 'Blah  Blah  Steel'  Steel
789 'Blah  Blah   Gold'   Gold

Note that you have to import the re module in order to use the re.IGNORECASE flag. As an alternative you could use 2 directly that is the value of the re.IGNORECASE flag.

UPDATE

As mentioned by @user3483203 you can save the import by using:

df['c'] = df.b.str.extract('(?i)({0})'.format('|'.join(search_list)))
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • 3
    You don't need the extra import, `df.b.str.extract('(?i)({0})'.format('|'.join(search_list)))` works just fine – user3483203 Mar 06 '19 at 17:19
  • Brilliant. Thanks – Big_Daz Mar 07 '19 at 08:57
  • If the first record said, say, 'Blah Blah Steel Gold', how might you report both of those words in column c? I am currently having this issue where I am hitting multiple words in my search list and would like to record all of them, preferably separated by a comma. – CJJ Sep 30 '20 at 21:14
5

You can use set.intersection to find which words appear in column b:

search_list = set(['STEEL','IRON','GOLD','SILVER'])
df['c'] = df['b'].apply(lambda x: set.intersection(set(x.upper().split(' ')), search_list))

Output:

     a                b        c
0  123  Blah Blah Steel  {STEEL}
1  456   Blah Blah Blah       {}
2  789   Blah Blah Gold   {GOLD}

Use df[df['c'].astype(bool)] if you want to get rid of rows without a match

     a                b        c
0  123  Blah Blah Steel  {STEEL}
2  789   Blah Blah Gold   {GOLD}
user3483203
  • 50,081
  • 9
  • 65
  • 94
busybear
  • 10,194
  • 1
  • 25
  • 42
2

One way to do it is

def get_word(my_string):
    for word in search_list:
         if word.lower() in my_string.lower():
               return word
    return None

new_df["c"]= new_df["b"].apply(get_word)

You can also do something along the lines of

new_df["c"]= new_df["b"].apply(lambda my_string: [word for word in search_list if word.lower() in my_string.lower()][0])

With the first one, you have the option of adding column c to df first, and then filtering out the Nones, while the second one will throw an error if b doesn't contain any of the words.

You can also see this question: Get the first item from an iterable that matches a condition

Applying the method from the top rated answer would give

new_df["c"]= new_df["b"].apply(lambda my_string: next(word for word in search_list if word.lower() in my_string.lower())
Acccumulation
  • 3,491
  • 1
  • 8
  • 12
2

You can also do:

import pandas as pd

search_list = ('STEEL','IRON','GOLD','SILVER')

df = pd.DataFrame({'a':[123,456,789],'b':['blah blah Steel','blah blah blah','blah blah Gold']})

df.assign(c = df['b'].apply(lambda x: [j for j in x.split() if j.upper() in search_list]))

Updated for speed


import pandas as pd

search_list = set(['STEEL','IRON','GOLD','SILVER'])

df = pd.DataFrame({'a':[123,456,789],'b':['blah blah Steel','blah blah blah','blah blah Gold']})

df.assign(c = lambda d: d['b'].str.upper().str.split().map(lambda x: set(x).intersection(search_list)))

Results: enter image description here

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
  • Using this answer, I am getting an error "'float' object has no attribute 'split'"-Any idea why this might be? – CJJ Oct 06 '20 at 14:17
  • Substitute `x.split` with `str(x).split` – Prayson W. Daniel Oct 06 '20 at 16:47
  • When I make that substitution, I now get the error TypeError: 'builtin_function_or_method' object is not iterable. – CJJ Oct 06 '20 at 17:24
  • What data type is your column data and is there missing values? – Prayson W. Daniel Oct 06 '20 at 17:31
  • There is one blank value which gets automatically populated with nan upon import. Should I populate it with something else? When I do df.dtypes, column 'b' says 'object'. The only manipulation done to that column is str.upper() upon import. – CJJ Oct 06 '20 at 17:42
  • Can you do `df['b'].fillna('', inplace=True)` before applying the function? – Prayson W. Daniel Oct 06 '20 at 18:05
  • I know longer get an error using df['b'].fillna('', inplace=True), however, 'c' is not populating at all even though there are words in my search list in 'b'. Could it be because some of the elements in my search list are phrases rather than just single words? – CJJ Oct 06 '20 at 18:19
1

You can use:

search_list = ['STEEL','IRON','GOLD','SILVER']
pat = r'\b|\b'.join(search_list)
pat2 = r'({})'.format('|'.join(search_list))

df_new= df.loc[df.b.str.contains(pat,case=False,na=False)].reset_index(drop=True)
df_new['new_col']=df_new.b.str.upper().str.extract(pat2)
print(df_new)

     a                  b new_col
0  123  'Blah Blah Steel'   STEEL
1  789   'Blah Blah Gold'    GOLD
anky
  • 74,114
  • 11
  • 41
  • 70
1

Using

s=pd.DataFrame(df.b.str.upper().str.strip("'").str.split(' ').tolist())
s.where(s.isin(search_list),'').sum(1)
Out[492]: 
0    STEEL
1         
2     GOLD
dtype: object
df['New']=s.where(s.isin(search_list),'').sum(1)
df
Out[494]: 
     a                  b    New
0  123  'Blah Blah Steel'  STEEL
1  456   'Blah Blah Blah'       
2  789   'Blah Blah Gold'   GOLD
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Here, the solution with final result like your display:

search_list = ['STEEL','IRON','GOLD','SILVER']

def process(x):
    for s in search_list:
        if s in x['b'].upper(): print("'"+ s +"'");return "'"+ s +"'"
    return ''

df['c']= df.apply(lambda x: process(x),axis=1)
df = df.drop(df[df['c'] == ''].index).reset_index(drop=True)

print(df)

output:

     a                 b        c
0  123  'Blah Blah Steel  'STEEL'
1  789  'Blah Blah Gold'   'GOLD'
Frenchy
  • 16,386
  • 3
  • 16
  • 39
  • Would there be a way to modify this function so that is there are multiple words from search_list in column b, all of them would be returned in column c, say comma separated? – CJJ Oct 06 '20 at 19:09