1

I am trying to follow along with some already-written code to see how it works. At one point, the author uses .join() to unify two dfs as a relational database, so I am working with .join(), not merge, to see how it works. As I understand it, join uses merge anyway. I have read through the large Pandas Merging 101.

The .join() documentation says the on parameter accepts index or column name. I have not been able to get the column name version to work.

MWE:

I define two df's with a common column name, but not in the same order so a straight index join will not organize these correctly. Trying to use .join with on='name' results in a ValueError. I suspect it's trying to match the index of one to the name of another. I have to use set_index() to turn the column name into the index before it will work.

So, why is .join() insisting on using index all the time, if the documentation says I can provide a column name instead?

import pandas as pd

data1 = {
  "name": ["Sally", "Mary", "John"],
  "age": [50, 40, 30]
}

data2 = {
  "contacted": [True, False, False],
  "name": ["John", "Mary", "Sally"]
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

#newdf = df1.join(df2, on='name', lsuffix='_L', rsuffix='_R') #doesn't work "ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

newdf = df1.join(df2.set_index('name'), on='name', lsuffix='_L', rsuffix='_R')

print(newdf)
eschares
  • 73
  • 7
  • 1
    The documentation seems consistent with the behavior. For example, for `on` it says *Column or index level name(s) in the caller to join on the index in other* – topsail Jan 24 '23 at 16:47
  • Thanks, that's not how I initially read it but you are right. So `on` is referring to the column name in df1? And df2 is always index-level? – eschares Jan 24 '23 at 16:50
  • Yes that's my take too. I wasn't fully aware of this myself - good question :) – topsail Jan 24 '23 at 16:52
  • 2
    Yup, `df1.join(df2)` always merges via the index of df2 whereas `df1.merge(df2)` will merge on the column. [Here](https://stackoverflow.com/questions/22676081/what-is-the-difference-between-join-and-merge-in-pandas) is a better answer. – bajun65537 Jan 24 '23 at 16:56

1 Answers1

0

You can use pd.merge and do a left join on column names of the two dataframes, see below as an example on your code:

newdf= pd.merge(df1, df2, how='left', left_on='name', right_on='name')

This will work even if the column names are different in two dataframes.

  • Yes, I would have preferred they use `merge` which seems more flexible. The code was written using `join` so I wanted to investigate that specifically. – eschares Jan 24 '23 at 16:51