1

Let's say I have a combined dataframe named df as follows. Each row has buildings' info and their matched buildings' info. I hope to merge id of each building from df1, df2 and df3 (see below). The columns of df_num or matched_df_num is there to distingue which dataframe the building info come from, if it's equals to 1, means it's from df1, 2 means from df2, 3 means from df3.

   df_num       city                             name  matched_df_num  \
0      1   Shenzhen                      Kingkey 100              2   
1      2   Shenzhen           Ping An Finance Centre              2   
2      2   Shenzhen           Ping An Finance Centre              3   
3      2  Guangzhou     Guangzhou CTF Finance Centre              3   
4      3   Shanghai  Shanghai World Financial Center              3   

  matched_city                     matched_name  similarity_ratio  
0     Shenzhen           Ping An Finance Centre                51  
1    Guangzhou     Guangzhou CTF Finance Centre                66  
2     Shanghai  Shanghai World Financial Center                59  
3     Shanghai  Shanghai World Financial Center                56  
4     Changsha            Changsha IFS Tower T1                57  

I want to merge the column of ids from df1, df2 and df3 below for building names and matched names:

df1 = pd.DataFrame(np.array([
    [1010667747, 'Suzhou', 'Suzhou IFS'],
    [1010667356, 'Shenzhen', 'Kingkey 100'],
    [1010667289, 'Wuhan', 'Wuhan Center']]),
    columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
    [190010, 'Shenzhen', 'Ping An Finance Centre'],
    [190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
    [190015, 'Beijing', 'China Zun']]),
    columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
    ['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
    ['ZY-15', 'Hong Kong', 'International Commerce Centre'],
    ['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
    columns=['id', 'city', 'name']
)

This is my expected result:

   df_num       city                             name          id  \
0      1   Shenzhen                      Kingkey 100  1010667356   
1      2   Shenzhen           Ping An Finance Centre      190010   
2      2   Shenzhen           Ping An Finance Centre      190010   
3      2  Guangzhou     Guangzhou CTF Finance Centre      190012   
4      3   Shanghai  Shanghai World Financial Center       ZY-13   

   matched_df_num  matched_city                     matched_name  \
0              2     Shenzhen           Ping An Finance Centre   
1              2    Guangzhou     Guangzhou CTF Finance Centre   
2              3     Shanghai  Shanghai World Financial Center   
3              3     Shanghai  Shanghai World Financial Center   
4              3     Changsha            Changsha IFS Tower T1   

   similarity_ratio matched_id  
0                51     190010  
1                66     190010  
2                59      ZY-13  
3                56      ZY-13  
4                57      ZY-16  

How could I insert two new columns id and matched_id and their values in df using Pandas? Thanks for helps at advance.

Update: my solution:

df = df.merge(df1, on = ['city', 'name'], how = 'left').merge(df2, on = ['city', 'name'], how = 'left').merge(df3, on = ['city', 'name'], how = 'left')
final_df = df.merge(df1, left_on = ['matched_city', 'matched_name'], right_on = ['city', 'name'], how = 'left').merge(df2, left_on = ['matched_city', 'matched_name'], right_on = ['city', 'name'], how = 'left').merge(df3, left_on = ['matched_city', 'matched_name'], right_on = ['city', 'name'], how = 'left')

   df_num     city_x                           name_x  matched_df_num  \
0       1   Shenzhen                      Kingkey 100               2   
1       2   Shenzhen           Ping An Finance Centre               2   
2       2   Shenzhen           Ping An Finance Centre               3   
3       2  Guangzhou     Guangzhou CTF Finance Centre               3   
4       3   Shanghai  Shanghai World Financial Center               3   

  matched_city                     matched_name  similarity_ratio        id_x  \
0     Shenzhen           Ping An Finance Centre                51  1010667356   
1    Guangzhou     Guangzhou CTF Finance Centre                66         NaN   
2     Shanghai  Shanghai World Financial Center                59         NaN   
3     Shanghai  Shanghai World Financial Center                56         NaN   
4     Changsha            Changsha IFS Tower T1                57         NaN   

     id_y   id_x id_y city_y name_y    id_x     city_x  \
0     NaN    NaN  NaN    NaN    NaN  190010   Shenzhen   
1  190010    NaN  NaN    NaN    NaN  190012  Guangzhou   
2  190010    NaN  NaN    NaN    NaN     NaN        NaN   
3  190012    NaN  NaN    NaN    NaN     NaN        NaN   
4     NaN  ZY-13  NaN    NaN    NaN     NaN        NaN   

                         name_x   id_y    city_y  \
0        Ping An Finance Centre    NaN       NaN   
1  Guangzhou CTF Finance Centre    NaN       NaN   
2                           NaN  ZY-13  Shanghai   
3                           NaN  ZY-13  Shanghai   
4                           NaN  ZY-16  Changsha   

                            name_y  
0                              NaN  
1                              NaN  
2  Shanghai World Financial Center  
3  Shanghai World Financial Center  
4            Changsha IFS Tower T1  
ah bon
  • 9,293
  • 12
  • 65
  • 148
  • Have you taken a look at the "merging multiple DataFrames" section in https://stackoverflow.com/questions/53645882/pandas-merging-101? – cs95 Mar 27 '19 at 05:34
  • Thanks. I update my solution in question, which is not as expected result, could help me to improve it? Thanks. – ah bon Mar 27 '19 at 05:49
  • What is wrong with it? – cs95 Mar 27 '19 at 05:55
  • Too much columns with `_x` and `_y`, I want only two more columns for `df`, `id` and `matched_id`. `id` is for first three columns, and `matched_id` is for last three columns. – ah bon Mar 27 '19 at 05:57
  • "matched_id" is not in any of your original dataframes, where does it come from? – cs95 Mar 27 '19 at 05:59
  • Both `id` and `matched_id` are merged from `df1`, `df2` and `df3`. `id` is for columns: `'df_num', 'city', 'name'`; while `mached_id` is for: `'matched_df_num', 'matched_city', 'matched_name'`. – ah bon Mar 27 '19 at 06:03
  • This is confusing because it isn't shown in your sample data. – cs95 Mar 27 '19 at 06:04
  • Yeah, in fact, columns names in `df` startwith `matched_` are from df1, df2 and df3 as well, it's used only for purpose of distingue with first tree column names as `_x` or `_y`. – ah bon Mar 27 '19 at 06:09

2 Answers2

1

You can use concat with merge and left join:

dff = pd.concat([df1, df2, df3])
print (dff)
           id       city                             name
0  1010667747     Suzhou                       Suzhou IFS
1  1010667356   Shenzhen                      Kingkey 100
2  1010667289      Wuhan                     Wuhan Center
0      190010   Shenzhen           Ping An Finance Centre
1      190012  Guangzhou     Guangzhou CTF Finance Centre
2      190015    Beijing                        China Zun
0       ZY-13   Shanghai  Shanghai World Financial Center
1       ZY-15  Hong Kong    International Commerce Centre
2       ZY-16   Changsha            Changsha IFS Tower T1

df = df.merge(dff,on = ['city', 'name'], how = 'left')
print (df)
   df_num       city                             name  matched_df_num  \
0       1   Shenzhen                      Kingkey 100               2   
1       2   Shenzhen           Ping An Finance Centre               2   
2       2   Shenzhen           Ping An Finance Centre               3   
3       2  Guangzhou     Guangzhou CTF Finance Centre               3   
4       3   Shanghai  Shanghai World Financial Center               3   

  matched_city                     matched_name  similarity_ratio          id  
0     Shenzhen           Ping An Finance Centre                51  1010667356  
1    Guangzhou     Guangzhou CTF Finance Centre                66      190010  
2     Shanghai  Shanghai World Financial Center                59      190010  
3     Shanghai  Shanghai World Financial Center                56      190012  
4     Changsha            Changsha IFS Tower T1                57       ZY-13

Then merge again, for avoid duplicated columns use rename:

d = {'city':'matched_city','name':'matched_name', 'id':'matched_id'}
df5 = df.merge(dff.rename(columns=d),on = ['matched_city', 'matched_name'], how = 'left')
print (df5)
   df_num       city                             name  matched_df_num  \
0       1   Shenzhen                      Kingkey 100               2   
1       2   Shenzhen           Ping An Finance Centre               2   
2       2   Shenzhen           Ping An Finance Centre               3   
3       2  Guangzhou     Guangzhou CTF Finance Centre               3   
4       3   Shanghai  Shanghai World Financial Center               3   

  matched_city                     matched_name  similarity_ratio          id  \
0     Shenzhen           Ping An Finance Centre                51  1010667356   
1    Guangzhou     Guangzhou CTF Finance Centre                66      190010   
2     Shanghai  Shanghai World Financial Center                59      190010   
3     Shanghai  Shanghai World Financial Center                56      190012   
4     Changsha            Changsha IFS Tower T1                57       ZY-13   

  matched_id  
0     190010  
1     190012  
2      ZY-13  
3      ZY-13  
4      ZY-16   

EDIT: You can add new values to each DataFrame by DataFrame.assign first, and then merge also by this column:

dff = pd.concat([df1.assign(df_num=1), df2.assign(df_num=2), df3.assign(df_num=3)])
df = df.merge(dff,on = ['city', 'name','df_num'], how = 'left')

d = {'city':'matched_city','name':'matched_name', 'id':'matched_id','df_num':'matched_df_num'}
df5 = (df.merge(dff.rename(columns=d), 
                on = ['matched_city', 'matched_name','matched_df_num'], 
                how = 'left'))
print (df5)
   df_num       city                             name  matched_df_num  \
0       1   Shenzhen                      Kingkey 100               2   
1       2   Shenzhen           Ping An Finance Centre               2   
2       2   Shenzhen           Ping An Finance Centre               3   
3       2  Guangzhou     Guangzhou CTF Finance Centre               3   
4       3   Shanghai  Shanghai World Financial Center               3   

  matched_city                     matched_name  similarity_ratio          id  \
0     Shenzhen           Ping An Finance Centre                51  1010667356   
1    Guangzhou     Guangzhou CTF Finance Centre                66      190010   
2     Shanghai  Shanghai World Financial Center                59      190010   
3     Shanghai  Shanghai World Financial Center                56      190012   
4     Changsha            Changsha IFS Tower T1                57       ZY-13   

  matched_id  
0     190010  
1     190012  
2      ZY-13  
3      ZY-13  
4      ZY-16  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks a lot. Can I get one more column named `matche_id` for building of `matched_df_num, matched_city, matched_name`. Please note for each row, there are two buildings' info: building and its matches, so I want get all their `id`s but seperate to `id` and `matched_id`. – ah bon Mar 27 '19 at 06:37
  • Please check expected result part in my question. :) – ah bon Mar 27 '19 at 06:39
  • One more comment, I think maybe `if condtion` is necessy here, `if df_num == 1`, then `df` should merge with `df1` based in cities and names, `if df_num == 2`, `df` should merge with `df2`, etc. Otherwise, merged `id` will not be accurate at some cases, if there some building info are same in `df1`, `df2` and `df3`. – ah bon Mar 27 '19 at 06:49
  • @ahbon - Edited answer. – jezrael Mar 27 '19 at 07:01
  • 1
    Cool and perfect. Thanks a lot. – ah bon Mar 27 '19 at 07:09
0

Try this, it may help you to solve your problem

    df1 = pd.DataFrame(np.array([
[1010667747, 'Suzhou', 'Suzhou IFS'],
[1010667356, 'Shenzhen', 'Kingkey 100'],
[1010667289, 'Wuhan', 'Wuhan Center']]),
columns=['id', 'city', 'name']
)
df2 = pd.DataFrame(np.array([
    [190010, 'Shenzhen', 'Ping An Finance Centre'],
    [190012, 'Guangzhou', 'Guangzhou CTF Finance Centre'],
    [190015, 'Beijing', 'China Zun']]),
    columns=['id', 'city', 'name']
)
df3 = pd.DataFrame(np.array([
    ['ZY-13', 'Shanghai', 'Shanghai World Financial Center'],
    ['ZY-15', 'Hong Kong', 'International Commerce Centre'],
    ['ZY-16', 'Changsha', 'Changsha IFS Tower T1']]),
    columns=['id', 'city', 'name']
)

df1['df_type'] = 1
df2['df_type'] = 2
df3['df_type'] = 3

df = pd.concat([df1,df2,df3])

df
Himmat
  • 166
  • 5