JOIN two dataframes on common column in python

Question

I have a dataframe df:

id   name   count
1    a       10
2    b       20
3    c       30
4    d       40
5    e       50

Here I have another dataframe df2:

id1  price   rating
 1     100     1.0
 2     200     2.0
 3     300     3.0
 5     500     5.0

I want to join these two dataframes on column id and id1(both refer same). Here is an example of df3:

id   name   count   price   rating
1    a       10      100      1.0
2    b       20      200      2.0
3    c       30      300      3.0
4    d       40      Nan      Nan
5    e       50      500      5.0

Should I use df.merge or pd.concat?

@piRSquared Sir, both answers are absolutely correct :) Can't pick both, just one question, suppose i am dealing with two dataframes each of around 4 million rows. i want the fastest way, in between `join,merge and map` Which one should be the most optimized way — Shubham R, Jan 04 '17 at 12:21
Both are essentially the same under the hood. And I don' t care which you pick. @jezrael and I are always on SO. We'll get our rep one way or another. I'm more interested in making sure those 15 rep don't go to waste. Pick his because he was a few micro seconds faster than me on this one ;-) — piRSquared, Jan 04 '17 at 12:23

jezrael · Accepted Answer · 2017-01-04T12:10:18.367

78

Use merge:

print (pd.merge(df1, df2, left_on='id', right_on='id1', how='left').drop('id1', axis=1))
   id name  count  price  rating
0   1    a     10  100.0     1.0
1   2    b     20  200.0     2.0
2   3    c     30  300.0     3.0
3   4    d     40    NaN     NaN
4   5    e     50  500.0     5.0

Another solution is simple rename column:

print (pd.merge(df1, df2.rename(columns={'id1':'id'}), on='id',  how='left'))
   id name  count  price  rating
0   1    a     10  100.0     1.0
1   2    b     20  200.0     2.0
2   3    c     30  300.0     3.0
3   4    d     40    NaN     NaN
4   5    e     50  500.0     5.0

If need only column price the simpliest is map:

df1['price'] = df1.id.map(df2.set_index('id1')['price'])
print (df1)
   id name  count  price
0   1    a     10  100.0
1   2    b     20  200.0
2   3    c     30  300.0
3   4    d     40    NaN
4   5    e     50  500.0

Another 2 solutions:

print (pd.merge(df1, df2, left_on='id', right_on='id1', how='left')
         .drop(['id1', 'rating'], axis=1))
   id name  count  price
0   1    a     10  100.0
1   2    b     20  200.0
2   3    c     30  300.0
3   4    d     40    NaN
4   5    e     50  500.0

print (pd.merge(df1, df2[['id1','price']], left_on='id', right_on='id1', how='left')
         .drop('id1', axis=1))
   id name  count  price
0   1    a     10  100.0
1   2    b     20  200.0
2   3    c     30  300.0
3   4    d     40    NaN
4   5    e     50  500.0

edited Jan 04 '17 at 12:10

answered Jan 04 '17 at 11:51

jezrael

822,522
95
1,334
1,252

keeping this answer, if from df2 i have to select only 1 column 'price' then? – Shubham R Jan 04 '17 at 11:53
I dont understand, can you explain more? – jezrael Jan 04 '17 at 11:55
final table has `id name count id1 price rating` but i want only `price` from df2 not `rating`, then? – Shubham R Jan 04 '17 at 11:56
the 2 ways that you suggested both are correct,right? – Shubham R Jan 04 '17 at 12:00
Yes, all solutions are correct. If need add more column, nicer and better is `join` (not necessary delete column, left join by default), but if need add only one column `map` is faster. – jezrael Jan 04 '17 at 12:05
for two huge dataframe(say each df has around 4 million rows, should i use merge or map? which one would take less time to complete – Shubham R Jan 04 '17 at 12:08
do you need add one or 2 columns? – jezrael Jan 04 '17 at 12:09
just a question, my requirements may change, sometimes one column sometimes 2 columns. which one would be best in both the cases – Shubham R Jan 04 '17 at 12:10
hmmm, I try test it, but `join` or `merge` is more universal. – jezrael Jan 04 '17 at 12:12
I try test if `join` or `merge` is faster. – jezrael Jan 04 '17 at 12:12

piRSquared · Answer 2 · 2017-01-04T12:15:03.520

join utilizes the index to merge on unless we specify a column to use instead. However, we can only specify a column instead of the index for the 'left' dataframe.

Strategy:

set_index on df2 to be id1
use join with df as the left dataframe and id as the on parameter. Note that I could have set_index('id') on df to avoid having to use the on parameter. However, this allowed me leave the column in the dataframe rather than having to reset_index later.

df.join(df2.set_index('id1'), on='id')

   id name  count  price  rating
0   1    a     10  100.0     1.0
1   2    b     20  200.0     2.0
2   3    c     30  300.0     3.0
3   4    d     40    NaN     NaN
4   5    e     50  500.0     5.0

If you only want price from df2

df.join(df2.set_index('id1')[['price']], on='id')


   id name  count  price
0   1    a     10  100.0
1   2    b     20  200.0
2   3    c     30  300.0
3   4    d     40    NaN
4   5    e     50  500.0

keeping this answer, if from df2 i have to select only 1 column 'price' then? — Shubham R, Jan 04 '17 at 11:54

JOIN two dataframes on common column in python

2 Answers2

Linked

Related