-1

I have 2 dataframes left_df and right-df, which both have 20 columns with identical names and dtypes. right_df also has 2 additional columns with unique values on every row.

I want to update rows in right_df with ALL the values from left_df where the values in ALL columns in a list of a subset of columns, matching_cols = ['col_1', 'col_3', 'col_10', 'col_12'] are identical in both dataframes. The values in the additional 2 unique columns in right_df should be preserved.

Ideally, I want to also drop those rows from left_df in the same command, or as the next command if this isn't possible. I need to do this process more than once, matching on several different lists of columns, with the left_df dropping matched rows each loop, until eventually no further matches are found.

An acceptable alternative would be any method to create a new dataframe new_df containing the set of rows where all specified columns in the list matching_cols match, with values from left_df in the first 20 columns and values from right_df in the remaining 2 columns.

I don't care about preserving the indices at any point in either dataframe, I am importing them to SQL after this and will reindex them on one of the 2 right_df values at the end.

New to Pandas and can't determine what method to use, have tried variations of .merge, .join, .update, etc, but can't seem to specify to only update when my desired column values all match, or how to drop those rows/export them to a new df.

Update: Added pseudocode below:

For a left_df as:

left_df = pd.DataFrame({
   'col_0': ['0', '1', '2', '3', '4', '5'],
   'col_1': ['A', 'B', 'C', 'D', 'E', 'F'],
   'col_2': ['new', 'new', 'new', 'new', 'new', 'new'],
   'col_3': ['new', 'new', 'new', 'new', 'new', 'new'],
   'col_4': ['new', 'new', 'new', 'new', 'new', 'new'],
   'col_5': ['new', 'new', 'new', 'new', 'new', 'new'],
   'col_6': ['new', 'new', 'new', 'new', 'new', 'new'],
   'col_7': ['new', 'new', 'new', 'new', 'new', 'new'],                
  })

and a right_df as:

right_df = pd.DataFrame({
   'col_0': ['0', '1', '2', '3', '4', '5'],
   'col_1': ['A', 'B', 'C', 'X', 'E', 'F'],
   'col_2': ['old', 'old', 'old', 'old', 'old', 'old'],
   'col_3': ['old', 'old', 'old', 'old', 'old', 'old'],
   'col_4': ['old', 'old', 'old', 'old', 'old', 'old'],
   'col_5': ['old', 'old', 'old', 'old', 'old', 'old'],
   'col_6': ['old', 'old', 'old', 'old', 'old', 'old'],
   'col_7': ['old', 'old', 'old', 'old', 'old', 'old'],
   'col_8': ['uid_0', 'uid_1', 'uid_2', 'uid_3', 'uid_4', 'uid_5'],
   'col_9': ['uid_a', 'uid_b', 'uid_c', 'uid_d', 'uid_e', 'uid_f'],                
  })

Where matching_cols = ['col_0', 'col_1']

I want to get the following result either as a new dataframe or in-place on right_df (note that col_1 doesn't match on row 3, so is not changed)

  col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7  col_8  col_9
0     0     A   new   new   new   new   new   new  uid_0  uid_a
1     1     B   new   new   new   new   new   new  uid_1  uid_b
2     2     C   new   new   new   new   new   new  uid_2  uid_c
3     3     X   old   old   old   old   old   old  uid_3  uid_d
4     4     E   new   new   new   new   new   new  uid_4  uid_e
5     5     F   new   new   new   new   new   new  uid_5  uid_f
Mansard
  • 11
  • 6

2 Answers2

0

Try this

new_df=pd.concat([left_df,right_df.iloc[:,-1:-3]],axis=1)

Arghya Sadhu
  • 41,002
  • 9
  • 78
  • 107
shyam_gupta
  • 309
  • 2
  • 11
  • in this, what's `.iloc[:,-1:-3]` referring to? – Mansard Aug 10 '20 at 20:36
  • update: tried it, it works but this omits the 2 columns I need from `right_df` to be included in the result. – Mansard Aug 10 '20 at 20:38
  • ```.iloc[:,-1:-3]``` is indexing of the columns in the dataframe if you need specific columns you can also try ```.loc[:,['list of columns seperated by commas in quotes']]``` this will surely solve everything – shyam_gupta Aug 10 '20 at 21:11
  • i just now saw the edit you made and what i think is ```above answer will do work (.loc) ``` since its the only correct way i could possibly think of concatinating the data ``` you have to specify the names correctly``` in loc .....do try to remove the occurence of ```col_8``` twice in the data – shyam_gupta Aug 10 '20 at 21:15
  • ```new_df=pd.concat([left_df,right_df.loc[:,['col_0','col_1','col_8','col_9']],axis=1)``` – shyam_gupta Aug 10 '20 at 21:21
  • Sorry, even after adding a missing bracket in what I THINK is the right spot, this yields a syntax error: File "", line 1 new_df=pd.concat([left_df,right_df.loc[:,['col_0','col_1','col_8','col_9']],axis=1]) ^ SyntaxError: invalid syntax – Mansard Aug 11 '20 at 02:12
0

Worked it out thanks to this post and the Pandas documentation:

First, it's a .merge I need, and I specify the suffixes as '_r' for only the columns to be copied from the right_df / for the old values I'm updating:

merged_df = pd.merge(left_df, right_df, on=['col_0', 'col_1'], suffixes=(None, '_r'))

This yields a new dataframe with rows containing both the new and old columns, only for rows in each dataframe where the values in columns on=['col_0', 'col_1'] are a match. Then I drop the "old" columns by using a regex filter on the text '_r':

merged_df.drop(list(merged_df.filter(regex = '_r')), axis=1, inplace=True)

This yields a dataframe with only the "modified" rows and no unmodified rows, which is close enough for what I need.

  col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7  col_8  col_9
0     0     A   new   new   new   new   new   new  uid_0  uid_a
1     1     B   new   new   new   new   new   new  uid_1  uid_b
2     2     C   new   new   new   new   new   new  uid_2  uid_c
3     4     E   new   new   new   new   new   new  uid_4  uid_e
4     5     F   new   new   new   new   new   new  uid_5  uid_f
Mansard
  • 11
  • 6