6

I have two DataFrames and want to use the second one only on the rows whose index is not already contained in the first one.

What is the most efficient way to do this?

Example:

df_1
idx     val
0      0.32
1      0.54
4      0.26
5      0.76
7      0.23

df_2
idx     val
1     10.24
2     10.90
3     10.66
4     10.25
6     10.13
7     10.52

df_final
idx     val
0      0.32
1      0.54
2     10.90
3     10.66
4      0.26
5      0.76
6     10.13
7      0.23

Recap: I need to add the rows in df_2 for which the index is not already in df_1.


EDIT

Removed some indices in df_2 to illustrate the fact that all indices from df_1 are not covered in df_2.

Jivan
  • 21,522
  • 15
  • 80
  • 131

2 Answers2

4

You can use reindex with combine_first or fillna:

df = df_1.reindex(df_2.index).combine_first(df_2)
print (df)
       val
idx       
0     0.32
1     0.54
2    10.90
3    10.66
4     0.26
5     0.76
6    10.13
7     0.23

df = df_1.reindex(df_2.index).fillna(df_2)
print (df)
       val
idx       
0     0.32
1     0.54
2    10.90
3    10.66
4     0.26
5     0.76
6    10.13
7     0.23
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I didn't make it obvious in my example, but the `df_2` index does not cover all the `df_1` index. Hence this method does not work as it is, it takes only the indices which appear in `df_2`. – Jivan Feb 20 '17 at 13:30
  • I try test it, for me it works. Can you add it to samples where is problem in my solution? Thanks. – jezrael Feb 20 '17 at 13:34
  • I have `len(df) = len(df_2)` when running either of your solutions. The indices which are in `df_1` but not in `df_2` are removed. – Jivan Feb 20 '17 at 13:36
  • I cannot simulate it. I try same `length` and for me it works. I dont know what is problem :( – jezrael Feb 20 '17 at 13:45
  • 2
    @Jivan You can achieve the wanted result by using `df_final = df_1.combine_first(df_2)`. – sodd Feb 20 '17 at 13:53
2

You can achieve the wanted output by using the combine_first method of the DataFrame. From the documentation of the method:

Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns

Example usage:

import pandas as pd

df_1 = pd.DataFrame([0.32,0.54,0.26,0.76,0.23], columns=['val'], index=[0,1,4,5,7])
df_1.index.name = 'idx'

df_2 = pd.DataFrame([10.56,10.24,10.90,10.66,10.25,10.13,10.52], columns=['val'], index=[0,1,2,3,4,6,7])
df_2.index.name = 'idx'

df_final = df_1.combine_first(df_2)

This will give the desired result:

In [7]: df_final
Out[7]:
       val
idx       
0     0.32
1     0.54
2    10.90
3    10.66
4     0.26
5     0.76
6    10.13
7     0.23
sodd
  • 12,482
  • 3
  • 54
  • 62