I am looking to merge 2 dataframes in python, df1 and df2, on 2 columns, Site and Building, with different row quantities as a way of attaining a "safe" value for each generator value in df1. Below is a demonstration code, though I've created the dataframes in the below example (which appears to work), the data for each table in the actual issue comes from SQL queries, which leads me to believe the merge is having issues due to data type.
import pandas as pd
df = {'Site': ['Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Belgium','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Holland','Greece','Greece','Greece','Greece','Greece','Greece'],
'Building' : ['X1','X1','X1','X1','X1','X1','X2','X2','X2','X2','X2','X2','X3','X3','X3','X3','X3','X3','X4','X4','X4','X4','X4', 'X4','X5','X5','X5','X5','X5','X5','X1','X1', 'X1','X1', 'X1','X1','X2','X2','X2','X2','X2','X2','X3','X3','X3','X3','X3','X3','X1', 'X1','X1', 'X1','X1', 'X1'],
'Generator' : ['DE','NDE', 'GBX1','GBX2','GBX3','GBX4','DE','NDE','GBX1','GBX2','GBX3','GBX4','DE','NDE', 'GBX1','GBX2','GBX3','GBX4','DE','NDE','GBX1','GBX2','GBX3','GBX4','DE','NDE','GBX1','GBX2','GBX3','GBX4', 'DE','NDE','GBX1','GBX2','GBX3','GBX4','DE', 'NDE','GBX1','GBX2','GBX3','GBX4', 'DE','NDE','GBX1','GBX2','GBX3','GBX4','DE','NDE','GBX1','GBX2','GBX3','GBX4']}
df1 = pd.DataFrame(df1, columns = ['Site', 'Building', 'Generator'])
df15 = {'Building' : ['X1','X2','X3','X4','X5','X1','X2','X3','X1'],
'Site': ['Belgium','Belgium','Belgium','Belgium','Belgium','Holland','Holland','Holland','Greece'],
'Safe' : [1, 1, 1, 1, 1, 0, 1, 1, 0]}
df2 = pd.DataFrame(df15, columns = ['Site', 'Building', 'Safe'])
df3 = df1.merge(df2, how = 'left', on = ['Site', 'Building'], indicator = True)
I've also tried changing the data types of each to string as per pandas - Merging on string columns not working (bug?),
df1['Site'] = df1['Site'].astype('str')
df1['Building']=df1['Building'].astype('str')
df1['Site'] = df1['Site'].astype('str')
df1['Building']=df1['Building'].astype('str')
and also the step that mentions checking encoding as per below, but all seem to match ie. there are no visible dependencies in content;
df1['Building'] = df1['Building'].str.encode('UTF-8')
df1['Site'] = df1['Site'].str.encode('UTF-8')
Datatypes:
df2.datatypes:
Site object
Building object
Safe object
dtype: object
df1.datatypes:
Building object
Site object
Generator object
dtype: object
I've tried the below code:
df3 = df1.merge(df2, left_on = ['Site', 'Building'], right_on = ['Site', 'Building'], how = 'left', indicator = 'indicator')
or:
df3 = df1.merge(df2, on = ['Site', 'Building'], how = 'left', indicator = 'indicator')
but the outcome ends up with only the data from left ie outcome 1.
I've tried outer join as per below, which yields outcome 2:
df3 = df1.merge(df2, on = ['Site', 'Building'], how = 'outer', indicator = 'indicator')
Apologies for my relative ignorance with respect to pandas.