2

I have two dataframes that I'm trying to join in pandas (version 0.18.1).

test1 = pd.DataFrame({'id': range(1,6), 'place': ['Kent','Lenawee','Washtenaw','Berrien','Ottawa']})

  id_1      place
0   1       Kent
1   2    Lenawee
2   3   Montreal
3   4    Berrien
4   5     Ottawa


test2 = pd.DataFrame({'id_2': range(6,11), 'id_parent': range(1,6)})

   id_2  id_parent
0     6          1
1     7          2
2     8          3
3     9          4
4    10          5

Yet when I join the two tables, the last row doesn't join properly and, because it's a left join, results in NaN.

df = test2.join(test1,on='id_parent',how='left')

   id_2  id_parent  id_1      place
0     6          1     2    Lenawee
1     7          2     3   Montreal
2     8          3     4    Berrien
3     9          4     5     Ottawa
4    10          5   NaN        NaN

This doesn't make sense to me-- id_parent and id_1 are the keys on which to join the two tables, and they both have the same value. Both columns have the same dtype (int64). What's going on here?

ale19
  • 1,327
  • 7
  • 23
  • 38

2 Answers2

4

join joins primarily on indices, use merge for this:

In [18]:
test2.merge(test1,left_on='id_parent', right_on='id')

Out[18]:
   id_2  id_parent  id      place
0     6          1   1       Kent
1     7          2   2    Lenawee
2     8          3   3  Washtenaw
3     9          4   4    Berrien
4    10          5   5     Ottawa

You get the NaN because the rhs will use the rhs index and there is no entry for 0 and 5 so you get NaN

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Ah...I assumed I should use .join() since in SQL I think this would have been a join. Oops. Thank you! – ale19 May 25 '16 at 14:31
2

Here I quote the documentation of pandas : 'join takes an optional on argument which may be a column or multiple column names, which specifies that the passed DataFrame is to be aligned on that column in the DataFrame. "

So in your case, you are matching the index of test2 on id_parent from test1.

LouisBBBB
  • 159
  • 4