I have two csv files that I want to merge and retain all columns. I want to match on the common fields 'CORPORATION', 'COMPANY_NAME', 'STATE_NAME', 'SITE', 'PADD'
. The most important field in the left csv is 'QUANTITY'
and the most important fields in the right csv are Latitude,Longitude
. I have tried using pandas.merge()
with how='outer'
, how='left'
, and how='right'
, but each option winds up giving me NaN
values for at least some of the fields I need. I have also tried using pandas.concat()
, but the values I need to match are not all in the same order, and that just appended the files in the order they were listed, so I had mismatching state names etc.
The files look like the following
refcap22_trimmed.csv:
CORPORATION,SURVEY,PERIOD,COMPANY_NAME,RDIST_LABEL,STATE_NAME,SITE,PADD,PRODUCT,SUPPLY,QUANTITY
DELEK GROUP LTD,820,22,ALON REFINING KROTZ SPRINGS INC,Louisiana Gulf Coast,Louisiana,KROTZ SPRINGS,3,TOTAL OPERABLE CAPACITY,Atmospheric Crude Distillation Capacity (barrels per stream day),83000
...
Petroleum_Refineries2019EIA.csv:
X,Y,FID,COMPANY_NAME,CORPORATION,SITE,STATE_NAME,PADD,AD_Mbpd,VDist_Mbpd,CaDis_Mbpd,HyCrk_Mbpd,VRedu_Mbpd,CaRef_Mbpd,Isal_Mbpd,HDS_Mbpd,Cokin_Mbpd,Asph_Mbpd,Source,Period_,Latitude,Longitude,Link_Imports,Process_Sum
-91.748001,30.526508,105,ALON REFINING KROTZ SPRINGS INC,DELEK GROUP LTD,Krotz Springs,Louisiana,3,83,36.2,34,,,13,12.2,32,,,EIA-820 Refinery Capacity Report,As of Jan. 1 2021,30.526508,-91.748001,1,210.4
...
The desired output would be something like: output.csv:
CORPORATION,SURVEY,PERIOD,COMPANY_NAME,RDIST_LABEL,STATE_NAME,SITE,PADD,PRODUCT,SUPPLY,QUANTITY,X,Y,FID,AD_Mbpd,VDist_Mbpd,CaDis_Mbpd,HyCrk_Mbpd,VRedu_Mbpd,CaRef_Mbpd,Isal_Mbpd,HDS_Mbpd,Cokin_Mbpd,Asph_Mbpd,Source,Period_,Latitude,Longitude,Link_Imports,Process_Sum
DELEK GROUP LTD,820,22,ALON REFINING KROTZ SPRINGS INC,Louisiana Gulf Coast,Louisiana,KROTZ SPRINGS,3,TOTAL OPERABLE CAPACITY,Atmospheric Crude Distillation Capacity (barrels per stream day),83000,-93.88847,29.96046,1,TOTAL SA,3,245,117,80,,,43,15.3,274.5,60,,EIA-820 Refinery Capacity Report,As of Jan. 1 2021,29.96046,-93.88847,1,834.8
...
I have tried the code below:
import pandas as pd
refcap = pd.read_csv('../refcap22_trimmed.csv')
eia2019 = pd.read_csv('../Petroleum_Refineries2019EIA.csv')
output = pd.merge(refcap, eia2019, on=['CORPORATION', 'COMPANY_NAME', 'STATE_NAME', 'SITE', 'PADD'], how='right')
print(output.iloc[1])
and got the following result:
CORPORATION DELEK GROUP LTD
SURVEY 820.0
PERIOD 22.0
COMPANY_NAME ALON USA ENERGY INC
RDIST_LABEL Texas Inland
STATE_NAME Texas
SITE BIG SPRING
PADD 3.0
PRODUCT TOTAL OPERABLE CAPACITY
SUPPLY Atmospheric Crude Distillation Capacity (barre...
QUANTITY 74000.0
X NaN
Y NaN
FID NaN
AD_Mbpd NaN
VDist_Mbpd NaN
CaDis_Mbpd NaN
HyCrk_Mbpd NaN
VRedu_Mbpd NaN
CaRef_Mbpd NaN
Isal_Mbpd NaN
HDS_Mbpd NaN
Cokin_Mbpd NaN
Asph_Mbpd NaN
Source NaN
Period_ NaN
Latitude NaN
Longitude NaN
Link_Imports NaN
Process_Sum NaN
Name: 1, dtype: object
But the result I need would look something like:
CORPORATION DELEK GROUP LTD
SURVEY 820.0
PERIOD 22.0
COMPANY_NAME ALON USA ENERGY INC
RDIST_LABEL Texas Inland
STATE_NAME Texas
SITE BIG SPRING
PADD 3.0
PRODUCT TOTAL OPERABLE CAPACITY
SUPPLY Atmospheric Crude Distillation Capacity (barre...
QUANTITY 74000.0
X -91.748001
Y 30.526508
FID 105
AD_Mbpd 83
VDist_Mbpd 36.2
CaDis_Mbpd 34
HyCrk_Mbpd NaN
VRedu_Mbpd NaN
CaRef_Mbpd 13
Isal_Mbpd 12.2
HDS_Mbpd 32
Cokin_Mbpd NaN
Asph_Mbpd NaN
Source EIA-820 RefCap Report
Period_ As of Jan. 1 2021
Latitude 30.526508
Longitude -91.748001
Link_Imports 1
Process_Sum NaN
Name: 1, dtype: object
How do I accomplish the desired output?