I am trying to merge two large data frames based on two common columns in these data frames. there is a small attempt and debate here but no promising solution
df1.year<=df2.year(same or later year to be manufactured)
df1.maker=df2.maker AND df1.location=df2.location
I prepared a small mock data to explain:
first data frame:
data = np.array([[2014,"toyota","california","corolla"],
[2015,"honda"," california", "civic"],
[2020,"hyndai","florida","accent"],
[2017,"nissan","NaN", "sentra"]])
df = pd.DataFrame(data, columns = ['year', 'make','location','model'])
df
second data frame:
data2 = np.array([[2012,"toyota","california","airbag"],
[2017,"toyota","california", "wheel"],
[2022,"hyndai","newyork","seat"],
[2017,"nissan","london", "light"]])
df2 = pd.DataFrame(data2, columns = ['year', 'make','location','id'])
df2
desired output:
data3 = np.array([[2017,"toyota",'corolla',"california", "wheel"]])
df3 = pd.DataFrame(data3, columns = ['year', 'make','model','location','id'])
df3
I tried to use the below approach but it is to slow and also not so accurate:
df4= pd.merge(df,df2, on=['location','make'], how='outer')
df4=df4.dropna()
df4['year'] = df4.apply(lambda x : x['year_y'] if x['year_y'] >= x['year_x'] else "0", axis=1)