I have two pandas dataframes with datetime.date indexes. Some indexes might duplicate. I want to use all rows in df1, and only the rows in df2 that are not in df1, to create df3. df3 must keep the date indexes.
Example dataframes:
import numpy as np
import pandas as pd
np.random.seed(100)
dates1 = pd.date_range('2020-01-01', periods=8)
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], index=dates1)
print('df1:')
print(df1)
df1:
A
2020-01-01 -1.749765
2020-01-02 0.342680
2020-01-03 1.153036
2020-01-04 -0.252436
2020-01-05 0.981321
2020-01-06 0.514219
2020-01-07 0.221180
2020-01-08 -1.070043
dates2 = pd.date_range('2020-01-06', periods=5)
df2 = pd.DataFrame(np.random.randn(5, 1), columns=['A'], index=dates2)
print('\ndf2:')
print(df2)
df2:
A
2020-01-06 -0.189496
2020-01-07 0.255001
2020-01-08 -0.458027
2020-01-09 0.435163
2020-01-10 -0.583595
I can solve this by iterating (absolutely slow), and also by list comprehension (way faster), but I don't see how to do it through vectorization. According to @cs95's answer to this question, the fastest methods are:
- Vectorization
- Cython routines
- List Comprehensions (vanilla for loop)
- DataFrame.apply(): i) Reductions that can be performed in cython, ii) Iteration in python space
- DataFrame.itertuples() and iteritems()
- DataFrame.iterrows()
The code to solve this using list comprehension is:
# Solution with list comprehension
new_rows = [x not in df1.index.values for x in df2.index.values]
df3 = df1.append(df2.loc[new_rows], verify_integrity=True)
print('\ndf3:')
print(df3)
df3:
A
2020-01-01 -1.749765
2020-01-02 0.342680
2020-01-03 1.153036
2020-01-04 -0.252436
2020-01-05 0.981321
2020-01-06 0.514219
2020-01-07 0.221180
2020-01-08 -1.070043
2020-01-09 0.435163
2020-01-10 -0.583595
As you can see, df3 takes df1 values over df2, but takes df2 values when there are none in df1 (this is the expected behavior).
Is there a way to do this with vectorization? Thanks!