Pandas merge and join not picking up correct values

Question

The join is not working. Sample data and code as follows.

Look up file:

     helper    time  Loc   FUEL        Rep      KM
 0.1|A100|A    0.1   A100  100.00%     -3.93     659
 0.1|A200|A    0.1   A200  100.00%     -4.49     628
 0.22|A100|B   0.22  A100  90.00%      -1.49     511
 ...

After importing look up file, did the following command to remove any spaces as there's was a keyerror before. I am guessing there might be some space issue within the columns.

dflookup.columns = dflookup.columns.to_series().apply(lambda x: x.strip())

Here's main file:

time     user        loc    dist   flightKM  loc2      helper1       
0.1      PilotL1     A100   A      140       A200      0.1|A200|A  
0.22     PilotL2     B100   B      230       A100      0.22|A100|B 
...

Expect output of main df

time    user      loc    dist   flightKM  loc2   helper1      Rep2    FUEL2    
0.1     PilotL1   A100   A      140       A200   0.1|A200|A   -3.93  100%
0.22    PilotL2   B100   B      230       A100   0.22|A100|B  -1.49  90%
...

Tried some of the solutions provided in SO. Haven't gotten a fix yet. Aim: to do a match using helper columns on Left, Right joins to add two columns (Rep, Fuel) from lookup into dfmain.

PROBLEM: would like some tips to solve the left, join issue as it's not finding all and correct values from lookup "Rep, FEUL" to dfmain. Open to a quick fix as well as tips to optimizing the code in anyway, as this is just a basic py script with possible adhoc operations.

code:

    dfmain['Loc'] = dfmain['Loc'].str.replace(' ', '')
    #creating a helper1 column in dfmain by concat columns as left, 
    right joins didnot allow a multi column in join operator
    
    dfmain['helper1'] = dfmain[['time', 'loc2', 'dist']].apply(
            lambda x: '|'.join(x.dropna().astype(str)),
            axis=1
        )
    
    #search merge
    dfmain = pd.merge(
    left=dfmain,
    right=dflookup[['helper', 'Rep', 'FUEL']],
    left_on='helper1',
    right_on='helper',
    how='left')

#tidy up
dfmain.rename(columns={'Rep':'Rep2'}, inplace=True)
dfmain.rename(columns={'FUEL':'FUEL2'}, inplace=True)
big_df = big_df.drop(columns=['helper'])

For scrutiny sake:

print("minimum reproducible code and dataset")

dflookup = pd.DataFrame([('falcon', 'bird', 100),
                          ('parrot', 'bird', 50),
                          ('lion', 'mammal', 50),
                          ('monkey', 'mammal', 100)],
                          columns=['type', 'class', 'years'],
                          index=[0, 2, 3, 1])


dfmain = pd.DataFrame([('Target','falcon', 'bird', 389.0),
                          ('Shout','parrot', 'bird', 24.0),
                          ('Roar','lion', 'mammal', 80.5),
                          ('Jump','monkey','mammal', np.nan),
                          ('Sing','parrot','bird', 72.0)],
                          columns=['name','type', 'class', 'max_speed'],
                          index=[0, 2, 3, 1, 2])

I don't understand what's wrong? – Corralien Feb 17 '23 at 10:41 — Corralien, Feb 17 '23 at 10:41

score 0 · Answer 1 · answered Feb 18 '23 at 15:49

0

Found a/the solution for now. Posting if anyone else at the beginning of python may face this.

Issue: apparently dist column has a single space. That made the helper1 join identifier to return false.

Worst was, for unknown/unsure reasons following command didn't do anything to remove this space from the raw column dist.

dfmain.columns = dfmain.columns.to_series().apply(lambda x: x.strip())

So, had to perform column exclusive space removal using below command before creating the helper1.

dfmain['dist'] = dfmain['dist'].str.replace(' ', '')

If anybody knows why first line using .apply(lambda)... didn't remove spaces as expected, please let me know here either as an answer or a comment.

answered Feb 18 '23 at 15:49

bonCodigo

14,268
1
48
91

1

Hi, with the first line, you try to remove white space characters from `dfmain` column headers (name, type, class, max_speed), not from the values in each column of `dfmain`. To do so, check this post: https://stackoverflow.com/a/45355563/11246056. Cheers. – Laurent Feb 19 '23 at 07:02
@Laurent oh that's very helpful. On another note, do you suggest a better code snippet to replace `apply()` in lambda using another method? Secondly, I saw a github bug about failure to use multi columns in pandas join operator. That's why I created a helper column thought it feels rather brute force in an era of lambda... Any thoughts on that? – bonCodigo Feb 19 '23 at 16:19
1

1/ As an alternative to apply, you can remove whitespace characters with replace: `dfmain = dfmain.replace(r"\s", "", regex=True)`. 2/ Not sure why you need to resort to helper columns in order to join the dataframes, with your MRE (which is not what you show at the beginning of your post, btw), the following expression works fine: `pd.merge(right=dfmain, left=dflookup, how="right", right_on=["type", "class"], left_on=["type", "class"])`. But perhaps I'm missing something. Cheers. – Laurent Feb 20 '23 at 13:20

Pandas merge and join not picking up correct values

1 Answers1

Linked