The join is not working. Sample data and code as follows.
Look up file:
helper time Loc FUEL Rep KM
0.1|A100|A 0.1 A100 100.00% -3.93 659
0.1|A200|A 0.1 A200 100.00% -4.49 628
0.22|A100|B 0.22 A100 90.00% -1.49 511
...
After importing look up file, did the following command to remove any spaces as there's was a keyerror before. I am guessing there might be some space issue within the columns.
dflookup.columns = dflookup.columns.to_series().apply(lambda x: x.strip())
Here's main file:
time user loc dist flightKM loc2 helper1
0.1 PilotL1 A100 A 140 A200 0.1|A200|A
0.22 PilotL2 B100 B 230 A100 0.22|A100|B
...
Expect output of main df
time user loc dist flightKM loc2 helper1 Rep2 FUEL2
0.1 PilotL1 A100 A 140 A200 0.1|A200|A -3.93 100%
0.22 PilotL2 B100 B 230 A100 0.22|A100|B -1.49 90%
...
Tried some of the solutions provided in SO. Haven't gotten a fix yet. Aim: to do a match using helper columns on Left, Right joins to add two columns (Rep, Fuel) from lookup into dfmain.
PROBLEM: would like some tips to solve the left, join issue as it's not finding all and correct values from lookup "Rep, FEUL" to dfmain. Open to a quick fix as well as tips to optimizing the code in anyway, as this is just a basic py script with possible adhoc operations.
code:
dfmain['Loc'] = dfmain['Loc'].str.replace(' ', '')
#creating a helper1 column in dfmain by concat columns as left,
right joins didnot allow a multi column in join operator
dfmain['helper1'] = dfmain[['time', 'loc2', 'dist']].apply(
lambda x: '|'.join(x.dropna().astype(str)),
axis=1
)
#search merge
dfmain = pd.merge(
left=dfmain,
right=dflookup[['helper', 'Rep', 'FUEL']],
left_on='helper1',
right_on='helper',
how='left')
#tidy up
dfmain.rename(columns={'Rep':'Rep2'}, inplace=True)
dfmain.rename(columns={'FUEL':'FUEL2'}, inplace=True)
big_df = big_df.drop(columns=['helper'])
For scrutiny sake:
print("minimum reproducible code and dataset")
dflookup = pd.DataFrame([('falcon', 'bird', 100),
('parrot', 'bird', 50),
('lion', 'mammal', 50),
('monkey', 'mammal', 100)],
columns=['type', 'class', 'years'],
index=[0, 2, 3, 1])
dfmain = pd.DataFrame([('Target','falcon', 'bird', 389.0),
('Shout','parrot', 'bird', 24.0),
('Roar','lion', 'mammal', 80.5),
('Jump','monkey','mammal', np.nan),
('Sing','parrot','bird', 72.0)],
columns=['name','type', 'class', 'max_speed'],
index=[0, 2, 3, 1, 2])