The problem isn't the lengths, it's the OFFERING_ID
.
In short, OFFERING_ID
isn't unique in the second dataframe. So you get more than one match per OFFERING_ID
, and thus more lines than the original.
I made an example in repl.it, the code is also pasted below:
import pandas as pd
df1 = pd.DataFrame(
[
{"OFFERING_ID": 1, "another_field": "whatever"},
{"OFFERING_ID": 2, "another_field": "whatever"},
{"OFFERING_ID": 3, "another_field": "whatever"},
{"OFFERING_ID": 4, "another_field": "whatever"},
]
)
df2 = pd.DataFrame(
[
{"OFFERING_ID": "1", "another_field": "whatever"},
{"OFFERING_ID": 1, "another_field": "whatever"},
{"OFFERING_ID": 1, "another_field": "whatever"},
]
)
print(df1.shape)
print(df2.shape)
print(pd.merge(df1, df2, on="OFFERING_ID", how="left").shape)