Match strings between two dataframes and create column

Question

I am trying to match parts of string from bad_boy to good_boy and create a column in the original df (bad_boy) called the Right Address but having hard time getting this accomplished. I have looked at the links below:

Replace whole string if it contains substring in pandas

Return DataFrame item using partial string match on rows pandas python

import pandas as pd
bad_boy = pd.read_excel('C:/Users/Programming/.xlsx')
df = pd.DataFrame(bad_boy)

print (df['Address'].head(3))

0  1234 Stack Overflow
1  7458 Python
2  8745 Pandas

good_boy = pd.read_excel('C:/Users/Programming/.xlsx')

df2 = pd.DataFrame(good_boy)

print (df2['Address'].head(10))

0 5896 Java Road
1 1234 Stack Overflow Way
2 7459 Ruby Drive
3 4517 Numpy Creek Way
4 1642 Scipy Trail
5 7458 Python Avenue
6 8745 Pandas Lane
7 9658 Excel Road
8 7255 Html Drive
9 7459 Selenium Creek Way

I tried this:

df['Right Address'] = df.loc[df['Address'].str.contains('Address', case = False, na = False, regex = False), df2['Address']]

but this throws out an error:

'None of [0.....all addresses\nName: Address, dtype: object] are in the [columns]'

Result being requested:

print (df['Right Address'].head(3))

0  1234 Stack Overflow Way
1  7458 Python Avenue
2  8745 Pandas Lane

your numbers column 1234, 7458 and 8745 all match in your two dataframes. can you just join on that and keep the df2 names? that would give your desired result. or do you need to do this by string matching? — Max Power, May 03 '17 at 17:25

Vaishali · Accepted Answer · 2017-05-03T20:02:34.277

4

You can use merge combined with str.extract for partial match

df1 = df1.merge(df2, left_on = df1.Address.str.extract('(\d+)', expand = False), right_on = df2.Address.str.extract('(\d+)', expand = False), how = 'inner').rename(columns = {'Address_y': 'Right_Address'})

You get

    Address_x           Right_Address
0   1234 Stack Overflow 1234 Stack Overflow Way
1   7458 Python         7458 Python Avenue
2   8745 Pandas         8745 Pandas Lane

edited May 03 '17 at 20:02

answered May 03 '17 at 18:26

Vaishali

37,545
5
58
86

Thanks, when I write `df1.to_excel`, the `Right_Address` does not show up. `print (df1.columns)` returns `Index(['Project', 'Order Date', 'Paid Date', 'Resale Released', 'Estimated Close Date', 'Estimated Sales Price', 'Address', 'Title Company', 'Title Company Email', 'Seller', 'Builder/HO', 'Actual Close Date', 'Actual Sales Price', 'Status of Assessments', 'Closing Received', 'Unnamed: 15', 'Unnamed: 16'], dtype='object')`. The Right_Address is not there. – Jacob_Cortese May 03 '17 at 19:49
2

did you assign by the merge to the df1 by doing df1 = df1.merge...? – Vaishali May 03 '17 at 19:52
That took care of the Address, but `FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)`, I tried doing `df1.Address.str.extract('(\d+)'expand = False), `...did not work. – Jacob_Cortese May 03 '17 at 19:57
2

Expand = False is taking care of the warning by the way. I have edited the answer – Vaishali May 03 '17 at 20:02
@Vaishali. Hi! I am trying to apply this to my dataset but seem to be having a hard time getting an issue. For some reason, when I apply it to my dataset, it seems to give me complete gibberish. Is this code supposed to handle the following? Say my one dataframe has one column with value "hello this", other dataframe has values "hello this is me" and "hello this is me???", what I want to see in the output is one column have "hello this" the other column has the associated 2 values joined to the "hello this". Is your code supposed to handle that? Not sure why its not working for me. – bernando_vialli Nov 28 '17 at 15:45
@mkheifetz, can you post a sample of your dataframe? – Vaishali Nov 28 '17 at 18:17
1

@ and no, this solution wont work in your example as the solution is matching based on numbers like 1234, 7458 whereas in your case, you need to match strings – Vaishali Nov 28 '17 at 18:19
@Vaishali, thank you for your response. So how can I adopt it to my text? Is it just a matter of changing what goes inside the extract? – bernando_vialli Nov 28 '17 at 18:22
Dataframe 1: column Q2: (row1) "Lot number on result was not the". (row2) "Your web site has quite possible the worst" Dataframe 2: column Q2: (row1) "The Lot number on result was not the biggest problem there" (row2) "hello my name is Alex". In this small example, there is a partial match between the two columns for row 1 as Dataframe 1 row 1 is a subset of Dataframe 2 row 1, so the output I would like to see is both of those two merged and showing in 2 columns next to each other... – bernando_vialli Nov 28 '17 at 18:28
Is this solved? how we can iterate through string columns to get the closest/similar string with another list? – pRo Mar 11 '22 at 15:19

Match strings between two dataframes and create column

1 Answers1