How to use the contain function in the columns of the first df to filter with the columns of the index

Question

I have two dataframes, main df and index df.
The target I want to do is let the column product of main df can use 'contain' function in index df to filter key word.
In the end, the main df can have new a column keyword to show main_df[keyword]=[C2,VA,E220F,7350M].

main df is

        data    num           product
 0  2019-10-01  39013000    xxxxxC2xxxxxxx
 1  2019-10-01  39013000    xxxxxxVAxxxxxxxxxxxx
 2  2019-10-28  39013000    xxxxxxxxE220Fxxxxxxxxxxxxx
 3  2019-12-31  39013000    xxxxxxxx7350Mxxxxxxxx

index df is

    product
0   VA
1   C2
2   7350M
3   E220F

My code is:

for key in key_word:
    mask = df_import_tmp0["product"].str.contains(key) 
df_import_tmp0['keyword']=key

The output is not what I want:

df_import_tmp0

    data         num            product                 keyword
0   2019-10-01  39013000    xxxxxC2xxxxxxx              E220F
1   2019-10-01  39013000    xxxxxxVAxxxxxxxxxxxx        E220F
2   2019-10-28  39013000    xxxxxxxxE220Fxxxxxxxxxxxxx  E220F
3   2019-12-31  39013000    xxxxxxxx7350Mxxxxxxxx       E220F

Does this answer your question? [Python Pandas - Merge based on substring in string](https://stackoverflow.com/questions/48743662/python-pandas-merge-based-on-substring-in-string) — fsimonjetz, Jul 27 '22 at 01:15

score 1 · Answer 1 · answered Jul 27 '22 at 01:12

Here's a way to do what you're asking:

df['keyword'] = df['product'].str.extract('(' + '|'.join(idx['product'].tolist()) + ')')

Input:

df:

         data       num                     product
0  2019-10-01  39013000              xxxxxC2xxxxxxx
1  2019-10-01  39013000        xxxxxxVAxxxxxxxxxxxx
2  2019-10-28  39013000  xxxxxxxxE220Fxxxxxxxxxxxxx
3  2019-12-31  39013000       xxxxxxxx7350Mxxxxxxxx

idx:

  product
0      VA
1      C2
2   7350M
3   E220F

Output:

         data       num                     product keyword
0  2019-10-01  39013000              xxxxxC2xxxxxxx      C2
1  2019-10-01  39013000        xxxxxxVAxxxxxxxxxxxx      VA
2  2019-10-28  39013000  xxxxxxxxE220Fxxxxxxxxxxxxx   E220F
3  2019-12-31  39013000       xxxxxxxx7350Mxxxxxxxx   7350M

Chris Seeling · Answer 2 · 2022-07-27T03:20:13.930

0

A simple way:

import numpy as np    
df = pd.DataFrame({"product": product_list})
    key_words = ['VA','C2','7350M','E220F']
    df["keyword"] = ''
    for kw in key_words:
        df['keyword'][df["product"].replace(np.nan,'').str.contains(kw)] = kw

Gives:

    product                     keyword
0   xxxxxC2xxxxxxx              C2
1   xxxxxxVAxxxxxxxxxxxx        VA
2   xxxxxxxxE220Fxxxxxxxxxxxxx  E220F
3   xxxxxxxx7350Mxxxxxxxx       7350M

edited Jul 27 '22 at 03:20

answered Jul 27 '22 at 01:18

Chris Seeling

606
4
11

It can work!!! BUT,I want also ask if the product is vacancy, I run this code would show error code "Cannot mask with non-boolean array containing NA / NaN values" . How can I solve this error. txs – Tony Lin Jul 27 '22 at 03:01
I edited the code above to deal with np.nan values – Chris Seeling Jul 27 '22 at 03:20
thanks u a lot .it can work fluent!!! – Tony Lin Jul 27 '22 at 03:50

How to use the contain function in the columns of the first df to filter with the columns of the index

2 Answers2