6

I have 2 pandas data frames - df_current_data, df_new_data.

my goal is to apply a merge (not a pandas merge function, merge like 'update\insert'). The check for a match is by key columns.

my result need to built by 3 optional rows-types.

  1. rows which exists in df_current_data but not exists in df_new_data - will insert "as is" to the result.

  2. rows which exists in df_new_data but not exists in df_current_data - will insert "as is" to the result.

  3. rows which exists in df_new_data and exists in df_current_data - the result need to take the rows from df_new_data.

This is a classic merge- upsert action.

example:

# rows 0,1 are in current and not in new (check by index1 and index2)
# row 2 is common
In [41]: df_current_source
Out[41]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  3       3       6

# rows 0,2 are in new and not in current (check by index1 and index2)
# row 1 is common
In [42]: df_new_source
Out[42]:    A  index1  index2
         0  4       2       7
         1  5       3       6
         2  6       4       5

# the result has 2 rows that only in current (rows 0,1)
# the result has 2 rows that only in new (rows 3,4)
# the result has one row that exists in both current and new (row 2 - index1 = 3, index2 = 6) - so the value of the column A is from the new and not from the current (5 instead of 2)

In [43]: df_result
Out[43]:    A  index1  index2
         0  1       1       4
         1  2       2       5
         2  5       3       6
         3  4       2       7
         4  6       4       5

That's what I did:

# left join from source to new
df = df_current_source.merge(df_new_source, how='left', left_on=p_new_keys, 
right_on=p_curr_keys, indicator=True)

# take only the rows that exists in the current and not exists in the source
df_only_current = df[df['_merge'] == 'left_only']

# merge new data into current data
df_result = pd.concat([df_only_current, df_new_source])

another option is with isin function:

df_result = pd.concat([df_current_source[~df_current_source[p_key_col_name]\

.isin(df_new_source[p_key_col_name])], df_new_source])

The problem is that if I have more than 1 key-column I can't use the isin, I need merge.

Assuming that the current is much bigger from the new, I guess the best way is to directly update the matching rows of current by the rows of new, and append the new rows of the "new" data frame into the current.

But I'm not sure how to do that..

Thanks a lot.

ejohnso49
  • 1,336
  • 1
  • 15
  • 20
user2671057
  • 1,411
  • 2
  • 25
  • 43
  • How are you updating by? Index? – cs95 Aug 21 '17 at 11:34
  • 1
    Also, please provide some input and expected output. – cs95 Aug 21 '17 at 11:36
  • What you mean updating? I have 2 columns that need to match. I added input and expected output. – user2671057 Aug 21 '17 at 12:18
  • In `df_result`, the row corresponding to A=3 is missing. – cs95 Aug 21 '17 at 12:20
  • It's not missing. this is what merge does - the "keys" of current and new are same .index1 = 3, index2 = 6. there is row with those values in both new and current. That's mean that this row exists, but we want to update the content of the row. so we take the value of this row from the new df (A=5) and put it instead of the current df (A=2). thats what I want that will be. – user2671057 Aug 21 '17 at 13:00

2 Answers2

1

Option 1: use indicator=True as part of merge:

df_out = df_current_source.merge(df_new_source, 
                                 on=['index1', 'index2'], 
                                 how='outer', indicator=True)

df_out['A'] = np.where(df_out['_merge'] == 'both',
                       df_out['A_y'],
                       df_out.A_x.add(df_out.A_y, fill_value=0)).astype(int)

df_out[['A', 'index1', 'index2']]

Output:

   A  index1  index2
0  1       1       4
1  2       2       5
2  5       3       6
3  4       2       7
4  6       4       5

Option 2: use combined_first with set_index

df_new_source.set_index(['index1', 'index2'])\
             .combine_first(df_current_source.set_index(['index1', 'index2']))\
             .reset_index()\
             .astype(int)

Output:

   index1  index2  A
0       1       4  1
1       2       5  2
2       2       7  4
3       3       6  5
4       4       5  6
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • can you please explain this part : df_out.A_x.add(df_out.A_y, fill_value=0)).astype(int) ? thanks – user2671057 Aug 21 '17 at 13:42
  • Sure. You can use merge with indicator=True, which labels the records to which dataframe the join happened on since you choose 'outer' it could be the data is from the left, right or both. Then, you can use np.where which acts like an if statement. If _merge the labelling column equals both then get the value from the new frame, else get the the value from the addition of left and right where the missing value is replace with 0. – Scott Boston Aug 21 '17 at 13:45
  • I am basically creating a new column thta combines A_x and A_y, by adding them together since one is always NaN, I replace the NaN with 0. Just a simple trick of getting the not-null value from either column. – Scott Boston Aug 21 '17 at 13:47
1

Check this link join or merge with overwrite in pandas. You can use combine_first:

combined_dataframe = df_new_source.set_index('A').combine_first(df_current_source.set_index('A'))
combined_dataframe.reset_index()

Output

    A  index1  index2
 0  1   1.0    4.0
 1  2   2.0    5.0
 2  3   2.0    7.0
 3  5   3.0    6.0
 4  6   4.0    5.0
Vico
  • 579
  • 3
  • 13