0

I am newbie to python and pandas. looking for help in subtraction between two different df column Sample data

df1 = pd.DataFrame({'label1': ['lab1', 'lab2', 'lab3', 'lab4'],
                   'compare1': [10, 20, 30, 40],
                   'col3': [100, 50, -30, -50]})

df2 = pd.DataFrame({'lable1': ['lab1', 'lab2', 'lab4', 'lab5'],
                   'compare1': [80,10, 40, 50],
                   'col3': [200, 60, 30, 50]})

df_diff = pd.DataFrame({'label': ['lab1', 'lab2', 'lab4', 'lab3', 'lab5'],
                        'df1_compare1': [10, 20, 40, 30, 0],
                        'df2_compare1': [80,10, 40, 0, 50],
                        'compare': [-70, 10, 0, 30, 50]})

Column name and count are same for both dfs.

Need help in subtraction (or other maths operation) between df1.compare1 and df2.compare2 if label1 and label2 has the same value.

for example, lab1, lab2 and lab 4(though lab 4 on different row) are having same label value. any rows which are not same in both df should show up at the bottom of the df with original value while other column data shows as 0.

Actual data are in CSV/excel MB in size, if possible looking for performance effective method

Does the non-null object has impact on performance?? This is the column type on label.

anno
  • 27
  • 7

1 Answers1

1

First merge your dataframes on the column label.

Finally create your compare column by substracting compare1-compare2:

df_diff = df1[['label1', 'compare1']].merge(df2[['label1', 'compare1']], 
                                            on='label1', 
                                            how='outer',
                                            suffixes=['_df1', '_df2'])


df_diff['compare'] = df_diff['compare1_df1'].fillna(0) - df_diff['compare1_df2'].fillna(0)

  label1  compare1_df1  compare1_df2  compare
0   lab1          10.0          80.0    -70.0
1   lab2          20.0          10.0     10.0
2   lab3          30.0           NaN     30.0
3   lab4          40.0          40.0      0.0
4   lab5           NaN          50.0    -50.0

If you have 20 min time, here's a good read on how merging works: Pandas Merging 101

Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Thanks Erfan, your solution did work as expected. what if my column names are same in both df ??I updated the data in the question – anno Jun 23 '19 at 12:42
  • That makes it easier, since we only have to use the `on` argument instead of `left_on` and `right_on`, and now we can make use of the `suffixes` argument to give `df1` and `df2` as suffix to our columns. See edited answer @anno – Erfan Jun 23 '19 at 12:49
  • Thanks Erfan. it is working as expected. any suggestion on performance ?? – anno Jun 23 '19 at 13:20
  • Do you have any issues with performance? Speedwise this should be quite efficient. Memorywise the data can get quite big, since we are doing an `outer` join to keep all the rows from both dataframes. But thats what we have to use, since thats the relationship you are defining in your question. @anno – Erfan Jun 23 '19 at 13:24
  • Don't forget to accept as answer if this helped you :) @ann – Erfan Jun 23 '19 at 14:17