4

I have two dataframes df1 and df2 as shown below:-

df1 = pd.DataFrame({'x': [1, '3', 5,'t','m','u'],'y':[2, 4, 6, 4, 4, 8]})

df2 = pd.DataFrame({'x': [1, 3, '4','t'],'z':[2, 4, 6,7]})

I am trying to merge(left join) the two data frames as:-

df=pd.merge(df1, df2, how='left', on='x')

the output is:-

df
Out[25]: 
   x  y    z
0  1  2  2.0
1  3  4  NaN
2  5  6  NaN
3  t  4  7.0
4  m  4  NaN
5  u  8  NaN

Clearly for second row above i.e for x=3, I would like to have z=4 instead of NaN.Is there an option to define data type of the key during merge or any other workaround where I can change the dtype of the keys to string in both data frames and get the desired output.

itthrill
  • 1,241
  • 2
  • 17
  • 36

2 Answers2

10

You can use assign to temporarily assign new dtype to the x column:

pd.merge(df1.assign(x=df1.x.astype(str)), 
         df2.assign(x=df2.x.astype(str)), 
         how='left', on='x')

Output:

   x  y    z
0  1  2  2.0
1  3  4  4.0
2  5  6  NaN
3  t  4  7.0
4  m  4  NaN
5  u  8  NaN
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thanks Scott, it worked and is exactly what I needed i,e to change dtype during merge step. Can you kindly let me know how can I use your code if col name instead of 'x' is let's say 'my id'. Apology if it's silly question. – itthrill Dec 06 '17 at 16:39
  • You can not use the assign with column names that contain special characters or a space. You'll need to either create a intermediate dataframe that that dtype change on 'my id', or add a new column to existing datafram with 'my id' casted to string, or change the name of the column from 'my id' to 'my_id' and use assign. @Wen method should still work if you wish to cast all columns in dataframe as string. – Scott Boston Dec 06 '17 at 16:42
  • My best recommendation is to change the column name or create a new column such that your space is an underscore like 'my_id' or 'myid' and this will work. – Scott Boston Dec 06 '17 at 17:14
  • I agree and I have done exactly the same. It works well. Additionally one of the dataframes was a value to a key of dictionary containing dataframes. Earlier when I was trying to use astype(str), it was throwing a warning like SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead. With you suggestion, the issue is resolved. – itthrill Dec 06 '17 at 17:16
  • Setting with copy is a hard issue to get handle over, the easiest way to eliminate that warning is to use .copy() when creating the new dataframe. Basically this waring is letting you know that any changes made on the newly created dataframe are not reflected in the original dataframe. And, like 99% of time you don't want this functionality anway. So, you can tell pandas that your newly created dataframe is seperate by using .copy(). – Scott Boston Dec 06 '17 at 17:19
  • See this [SO Post on SettingWtihCopyWarning](https://stackoverflow.com/q/20625582/6361531) – Scott Boston Dec 06 '17 at 17:22
  • 1
    Thanks Scott. Very helpful indeed. – itthrill Dec 06 '17 at 17:26
  • This is much better than blanket-casting the entire thing to string. +1 – cs95 Dec 11 '18 at 06:40
3

Your df1 and df2 ,have different dtype for 3 one is numeric another is str, so we convert them all to string they can get match

df=pd.merge(df1.astype(str), df2.astype(str), how='left', on='x')
df
Out[914]: 
   x  y    z
0  1  2    2
1  3  4    4
2  5  6  NaN
3  t  4    7
4  m  4  NaN
5  u  8  NaN
BENY
  • 317,841
  • 20
  • 164
  • 234