0

I need to merge the following df1 and df2, based on condition: if address in df1 contains state in df2.

df1:

                                                                           address  \
0      Cecilia Chapman 711-2880 Nulla St. Mankato Mississippi 96522 (257) 563-7401   
1  Iris Watson P.O. Box 283 8562 Fusce Rd. Frederick Nebraska 20620 (372) 587-2335   
2    Celeste Slater 606-3727 Ullamcorper. Street Roseville NH 11523 (786) 713-8616   
3            Theodore Lowe Ap #867-859 Sit Rd. Azusa New York 39531 (793) 151-6230   
4                 Calista Wise 7292 Dictum Av. San Antonio MI 47096 (492) 709-6392   

   quantity  price  
0         2     20  
1         3     13  
2         5     23  
3         3     32  
4         5     45  

df2:

   id        state
0   1  Mississippi
1   2     Nebraska
2   3     New York

My expected output will like this. How could I do that? Thank you.

                                                                           address  \
0      Cecilia Chapman 711-2880 Nulla St. Mankato Mississippi 96522 (257) 563-7401   
1  Iris Watson P.O. Box 283 8562 Fusce Rd. Frederick Nebraska 20620 (372) 587-2335   
2    Celeste Slater 606-3727 Ullamcorper. Street Roseville NH 11523 (786) 713-8616   
3            Theodore Lowe Ap #867-859 Sit Rd. Azusa New York 39531 (793) 151-6230   
4                 Calista Wise 7292 Dictum Av. San Antonio MI 47096 (492) 709-6392   

   quantity  price   id        state  
0         2     20  1.0  Mississippi  
1         3     13  2.0     Nebraska  
2         5     23  NaN          NaN  
3         3     32  3.0     New York  
4         5     45  NaN          NaN  

Update: the output of pat = '|'.join(r"\b{}\b".format(x) for x in df2['state']); print(df1['address'].str.extract('('+ pat + ')', expand=False))

      0    1    2    3    4    5    6    7    8    9  ...    40   41   42  \
0    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN   
1    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN   
2    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN   
3    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN    
..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ... ...   ...  ...  ...  
158  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN   
159  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN ...   NaN  NaN  NaN  
ah bon
  • 9,293
  • 12
  • 65
  • 148

1 Answers1

1

You can extract all possible states by Series.str.extract with \b\b for words boundaries to new column and then merge with left join:

pat = '|'.join(r"\b{}\b".format(x) for x in df2['state'])
df1['state']= df1['address'].str.extract('('+ pat + ')', expand=False)
print (df1)
                                             address  quantity  price  \
0  Cecilia Chapman 711-2880 Nulla St. Mankato Mis...         2     20   
1  Iris Watson P.O. Box 283 8562 Fusce Rd. Freder...         3     13   
2  Celeste Slater 606-3727 Ullamcorper. Street Ro...         5     23   
3  Theodore Lowe Ap #867-859 Sit Rd. Azusa New Yo...         3     32   
4  Calista Wise 7292 Dictum Av. San Antonio MI 47...         5     45   

         state  
0  Mississippi  
1     Nebraska  
2          NaN  
3     New York  
4          NaN  

df = df1.merge(df2, on='state', how='left')
print (df)
                                             address  quantity  price  \
0  Cecilia Chapman 711-2880 Nulla St. Mankato Mis...         2     20   
1  Iris Watson P.O. Box 283 8562 Fusce Rd. Freder...         3     13   
2  Celeste Slater 606-3727 Ullamcorper. Street Ro...         5     23   
3  Theodore Lowe Ap #867-859 Sit Rd. Azusa New Yo...         3     32   
4  Calista Wise 7292 Dictum Av. San Antonio MI 47...         5     45   

         state   id  
0  Mississippi  1.0  
1     Nebraska  2.0  
2          NaN  NaN  
3     New York  3.0  
4          NaN  NaN  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you, but I don't understand two `parenthese` in `str.extract('('+ ... + ')'`, could you explain more? – ah bon Feb 06 '20 at 10:10
  • 1
    @ahbon - it is because matching regex pattern need `(regex)`, so added `()` to `pat` – jezrael Feb 06 '20 at 10:12
  • Sorry, I get `ValueError: Wrong number of items passed 50, placement implies 1` with real data, any ideas? – ah bon Feb 07 '20 at 02:24
  • @ahbon Is possible some non alphabet or space values in `df2['state']`? – jezrael Feb 07 '20 at 04:58
  • 1
    Yes, I think so. – ah bon Feb 07 '20 at 04:59
  • 1
    @ahbon - Is possible test if change `pat = '|'.join(r"\b{}\b".format(x) for x in df2['state'])` to `import re` and `pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['state'])` ? – jezrael Feb 07 '20 at 06:05
  • If the datafame is in Chinese, it will not work out. Please check https://stackoverflow.com/questions/60219423/merge-if-one-column-contained-in-another-column-in-python. – ah bon Feb 14 '20 at 03:14
  • @ahbon - Unfortunately I have no experience with chinese text processing, so cannot halp with it. – jezrael Feb 14 '20 at 06:19
  • 1
    No problem. Thank you. :) But I think the logic should be same, except English characters have more space to split words. – ah bon Feb 14 '20 at 07:40