I am trying to follow along with some already-written code to see how it works. At one point, the author uses .join()
to unify two dfs as a relational database, so I am working with .join()
, not merge
, to see how it works. As I understand it, join
uses merge
anyway. I have read through the large Pandas Merging 101.
The .join()
documentation says the on
parameter accepts index or column name. I have not been able to get the column name version to work.
MWE:
I define two df's with a common column name
, but not in the same order so a straight index join will not organize these correctly. Trying to use .join
with on='name'
results in a ValueError. I suspect it's trying to match the index of one to the name
of another. I have to use set_index()
to turn the column name
into the index before it will work.
So, why is .join()
insisting on using index all the time, if the documentation says I can provide a column name instead?
import pandas as pd
data1 = {
"name": ["Sally", "Mary", "John"],
"age": [50, 40, 30]
}
data2 = {
"contacted": [True, False, False],
"name": ["John", "Mary", "Sally"]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
#newdf = df1.join(df2, on='name', lsuffix='_L', rsuffix='_R') #doesn't work "ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
newdf = df1.join(df2.set_index('name'), on='name', lsuffix='_L', rsuffix='_R')
print(newdf)