3

I am in my first few weeks of learning pandas and need help with a problem I am stuck with. I have 2 dataframes as listed below:

df1 = pd.DataFrame({
    'City': ['Chicago','Atlanta', 'Dallas', 'Atlanta', 'Chicago', 'Boston', 'Dallas','El Paso','Atlanta'],
    'State': ['IL','GA','TX','GA','IL','MA','TX','TX','GA'],
    'Population': [8865000,523738,6301000,523738,8865000,4309000,6301000,951000,523738]
}, columns=['City', 'State', 'Population'])

df1

    City    State   Population
0   Chicago IL     8865000
1   Atlanta GA     523738
2   Dallas  TX     6301000
3   Atlanta GA     523738
4   Chicago IL     8865000
5   Boston  MA     4309000
6   Dallas  TX     6301000
7   El Paso TX     951000
8   Atlanta GA     523738

.

df2 = pd.DataFrame({
    'Airport': ['Hartsfield','Logan','O Hare','DFW'],
    'M_Code': [78,26,52,39]
},index=[
    'Atlanta',
    'Boston',
    'Chicago',
    'Dallas'])


df2

          Airport        M_Code
Atlanta   Hartsfield     78
Boston    Logan          26
Chicago   O Hare         52
Dallas    DFW            39

Expected output is:

df1

    City    State   Population  M_Code  City_indexed_in_df2
0   Chicago IL      8865000     52      True
1   Atlanta GA      523738      78      True
2   Dallas  TX      6301000     39      True
3   Atlanta GA      523738      78      True
4   Chicago IL      8865000     52      True
5   Boston  MA      4309000     26      True
6   Dallas  TX      6301000     39      True
7   El Paso TX      951000      NaN     False
8   Atlanta GA      523738      78      True

I started with:

df1.loc[df1.City.isin(df2.index),:]

    City    State   Population
0   Chicago IL  8865000
1   Atlanta GA  523738
2   Dallas  TX  6301000
3   Atlanta GA  523738
4   Chicago IL  8865000
5   Boston  MA  4309000
6   Dallas  TX  6301000
8   Atlanta GA  523738

As expected this filters out the row with El Paso. But am not able to come up with code to do this --> For every df1.City I need to lookup on df2.index and if found:

  1. Extract df2.M_Code and insert the value to a new column df1.M_Code
  2. Insert boolean result to a new column df1.City_indexed_in_df2

Can someone help me with how I can acheive this. In addition, my thought is that creating a unique array from df1.City and then doing a lookup on df2.index may get improved performance(being a novice, I haven't figured out how to do this beyond extracting the unique array below.)

arr = df1.City.unique()

array(['Chicago', 'Atlanta', 'Dallas', 'Boston', 'El Paso'], dtype=object)

Suggestions on changing the solution approach will be great too.

  • 2
    just left merge , then assign a boolean colum: `df1.merge(df2['M_Code'],left_on='City',right_index=True,how='left').assign(City_indexed_in_df2=lambda x:x['M_Code'].notnull())` , refer this: https://stackoverflow.com/questions/53645882/pandas-merging-101 – anky Jun 21 '20 at 14:56
  • Yes. It does. Thank you Anky and Ch3steR – Raghavendra Channappa Jun 21 '20 at 15:30

1 Answers1

3

You can do this, merge with how='left' and then create a new column using notna():

df = df1.merge(df2, left_on=['City'], right_index=True, how='left')
df['City_indexed_in_df2'] = df['M_Code'].notna()
print(df)

      City State  Population     Airport  M_Code  City_indexed_in_df2
0  Chicago    IL     8865000      O Hare    52.0                 True
1  Atlanta    GA      523738  Hartsfield    78.0                 True
2   Dallas    TX     6301000         DFW    39.0                 True
3  Atlanta    GA      523738  Hartsfield    78.0                 True
4  Chicago    IL     8865000      O Hare    52.0                 True
5   Boston    MA     4309000       Logan    26.0                 True
6   Dallas    TX     6301000         DFW    39.0                 True
7  El Paso    TX      951000         NaN     NaN                False
8  Atlanta    GA      523738  Hartsfield    78.0                 True
NYC Coder
  • 7,424
  • 2
  • 11
  • 24