I have two DataFrames with a DatetimeIndex. I am copying a column (or more) from df1 into df2, on matching indexes. See the example below.
Example
df1
some_col
datetime
2021-01-01 00:10:00 0.906362
2021-01-01 00:11:00 0.361477
2021-01-01 00:12:00 0.007753
2021-01-01 00:13:00 0.639844
2021-01-01 00:14:00 0.885964
... ...
2022-11-26 10:45:00 0.539110
2022-11-26 10:46:00 0.177116
2022-11-26 10:47:00 0.580149
2022-11-26 10:48:00 0.600307
2022-11-26 10:49:00 0.449839
[1000000 rows x 1 columns]
df2
to_copy
datetime
2021-01-01 00:00:00 6383.498664
2021-01-01 00:05:00 6252.822505
2021-01-01 00:10:00 5553.735088
2021-01-01 00:15:00 474.491263
2021-01-01 00:20:00 7758.722618
... ...
2022-11-26 18:35:00 542.154115
2022-11-26 18:40:00 2031.866841
2022-11-26 18:45:00 1828.248310
2022-11-26 18:50:00 2748.979463
2022-11-26 18:55:00 4629.085881
[200100 rows x 1 columns]
Final df
some_col to_copy
datetime
2021-01-01 00:10:00 0.906362 5553.735088
2021-01-01 00:11:00 0.361477 NaN
2021-01-01 00:12:00 0.007753 NaN
2021-01-01 00:13:00 0.639844 NaN
2021-01-01 00:14:00 0.885964 NaN
... ... ...
2022-11-26 10:45:00 0.539110 718.218622
2022-11-26 10:46:00 0.177116 NaN
2022-11-26 10:47:00 0.580149 NaN
2022-11-26 10:48:00 0.600307 NaN
2022-11-26 10:49:00 0.449839 NaN
[1000000 rows x 2 columns]
My code
The way I currently copy the column/columns is the following:
df1.loc[np.intersect1d(df2.index.values, df1.index.values), "to_copy"] = df2["to_copy"]
df1["to_copy"] = df1["to_copy"].shift(interval - 1)
The problem
This line of code is currently a performance bottleneck in my program. When profiling the code, I noticed the slowest part is pandas indexing in loc
. Is there a faster way of copying columns on matching indexes?
Recreate example
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
# Dfs generation
rows_df1 = 1000000
df2_interval = 5 # min
rows_df2 = 200100
df1 = pd.DataFrame(
{
"datetime": [datetime(2021, 1, 1, 0, 10) + timedelta(minutes=i) for i in range(rows_df1)],
"some_col": np.random.rand(rows_df1),
}
).set_index("datetime")
df2 = pd.DataFrame(
{
"datetime": [datetime(2021, 1, 1, 0, 0) + timedelta(minutes=i * df2_interval) for i in range(rows_df2)],
"to_copy": np.random.rand(rows_df2),
}
).set_index("datetime")
# Copying a column on matching indexes
intersection = np.intersect1d(df2.index.values, df1.index.values)
df1.loc[intersection, "to_copy"] = df2.loc[intersection, "to_copy"]
print(df1)