2

I have two pandas dataframes with datetime.date indexes. Some indexes might duplicate. I want to use all rows in df1, and only the rows in df2 that are not in df1, to create df3. df3 must keep the date indexes.

Example dataframes:

import numpy as np
import pandas as pd

np.random.seed(100)

dates1 = pd.date_range('2020-01-01', periods=8)
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], index=dates1)
print('df1:')
print(df1)

df1:
                   A
2020-01-01 -1.749765
2020-01-02  0.342680
2020-01-03  1.153036
2020-01-04 -0.252436
2020-01-05  0.981321
2020-01-06  0.514219
2020-01-07  0.221180
2020-01-08 -1.070043

dates2 = pd.date_range('2020-01-06', periods=5)
df2 = pd.DataFrame(np.random.randn(5, 1), columns=['A'], index=dates2)
print('\ndf2:')
print(df2)

df2:
                   A
2020-01-06 -0.189496
2020-01-07  0.255001
2020-01-08 -0.458027
2020-01-09  0.435163
2020-01-10 -0.583595

I can solve this by iterating (absolutely slow), and also by list comprehension (way faster), but I don't see how to do it through vectorization. According to @cs95's answer to this question, the fastest methods are:

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply(): i) Reductions that can be performed in cython, ii) Iteration in python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

The code to solve this using list comprehension is:

# Solution with list comprehension
new_rows = [x not in df1.index.values for x in df2.index.values]
df3 = df1.append(df2.loc[new_rows], verify_integrity=True)
print('\ndf3:')
print(df3)
df3:
                   A
2020-01-01 -1.749765
2020-01-02  0.342680
2020-01-03  1.153036
2020-01-04 -0.252436
2020-01-05  0.981321
2020-01-06  0.514219
2020-01-07  0.221180
2020-01-08 -1.070043
2020-01-09  0.435163
2020-01-10 -0.583595

As you can see, df3 takes df1 values over df2, but takes df2 values when there are none in df1 (this is the expected behavior).

Is there a way to do this with vectorization? Thanks!

xiaxio
  • 631
  • 1
  • 10
  • 15
  • something with `combine_first`? – Quang Hoang Mar 29 '20 at 14:16
  • 2
    This is combine_first `df1.combine_first(df2)` – Erfan Mar 29 '20 at 14:17
  • Hi. Turns out that for this particular case, using combine_first is slower than my proposed solution. ```%timeit df3 = df1.append(df2.loc[[x not in df1.index.values for x in df2.index.values]], verify_integrity=True)``` shows ```1.39 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)```. ```%timeit df3 = df1.combine_first(df2)``` shows ```2.36 ms ± 39.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)```. I was looking for a faster solution, not a slower. Even though I appreciate your help, I don't think the solution is vectorized, as it should be in the microseconds. – xiaxio Mar 29 '20 at 14:57

0 Answers0