3

I have a large data frame df and a small data frame df_right with 2 columns a and b. I want to do a simple left join / lookup on a without copying df.

I come up with this code but I am not sure how robust it is:

dtmp = pd.merge(df[['a']], df_right, on = 'a', how = "left")  #one col left join
df['b'] = dtmp['b'].values

I know it certainly fails when there are duplicated keys: pandas left join - why more results?

Is there better way to do this?

Related:

Outer merging two data frames in place in pandas

What are the exact downsides of copy=False in DataFrame.merge()?

user3226167
  • 3,131
  • 2
  • 30
  • 34

1 Answers1

5

You are almost there. There are 4 cases to consider:

  1. Both df and df_right do not have duplicated keys
  2. Only df has duplicated keys
  3. Only df_right has duplicated keys
  4. Both df and df_right have duplicated keys

Your code fails in case 3 & 4 since the merging extends the number of row count in df. In order to make it work, you need to choose what information to drop in df_right prior to merging. The purpose of this is to enforce any merging scheme to be either case 1 or 2.

For example, if you wish to keep "first" values for each duplicated key in df_right, the following code works for all 4 cases above.

dtmp = pd.merge(df[['a']], df_right.drop_duplicates('a', keep='first'), on='a', how='left')
df['b'] = dtmp['b'].values

Alternatively, if column 'b' of df_right consists of numeric values and you wish to have summary statistic:

dtmp = pd.merge(df[['a']], df_right.groupby('a').mean().reset_index(drop=False), on='a', how='left')
df['b'] = dtmp['b'].values
Asclepius
  • 57,944
  • 17
  • 167
  • 143
gyoza
  • 2,112
  • 2
  • 12
  • 18
  • This is just what I needed, is there a way to only fill na's within a subset of columns from df_right to df? – Matt_Davis Mar 21 '20 at 12:28