I know there are many questions like this one but I can't seem to find the relevant answer. Let's say I have 2 data frames as follow:
df1 = pd.DataFrame(
{
"end": [
"2019-08-31",
"2019-08-28",
"2019-09-09",
"2019-09-08",
"2019-09-14",
"2019-09-14",
],
"start": [
"2019-08-27",
"2019-08-22",
"2019-08-04",
"2019-09-02",
"2019-09-06",
"2019-09-10",
],
"id": [1234, 8679, 8679, 1234, 1234, 8679],
}
)
df2 = pd.DataFrame(
{
"timestamp": [
"2019-08-30 10:00",
"2019-08-28 10:00",
"2019-08-27 10:30",
"2019-08-07 12:00",
"2019-09-12 10:00",
"2019-09-11 14:00",
"2019-08-29 18:00",
],
"id": [1234, 1234, 8679, 1234, 8679, 8679, 1234],
"val": ["AAAB", "ABBA", "CXXC", "BBAA", "XCXC", "CCXX", "BAAB"],
}
)
df1["end"] = pd.to_datetime(df1["end"])
df1["start"] = pd.to_datetime(df1["start"])
df2["timestamp"] = pd.to_datetime(df2["timestamp"])
df1.sort_values(by=["end"], inplace=True)
df2.sort_values(by="timestamp", inplace=True)
Resulted as:
end start id
0 2019-08-31 2019-08-27 1234
1 2019-08-28 2019-08-22 8679
2 2019-09-09 2019-08-04 8679
3 2019-09-08 2019-09-02 1234
4 2019-09-14 2019-09-06 1234
5 2019-09-14 2019-09-10 8679
timestamp id val
0 2019-08-30 10:00 1234 AAAB
1 2019-08-28 10:00 1234 ABBA
2 2019-08-27 10:30 8679 CXXC
3 2019-08-07 12:00 1234 BBAA
4 2019-09-12 10:00 8679 XCXC
5 2019-09-11 14:00 8679 CCXX
6 2019-08-29 18:00 1234 BAAB
The classic way to merge by ID so timestamp will be between start and end in df1 is by merge on id or dummy variable and filter:
merged_df = pd.merge(df1, df2, how="left", on="id")
merged_df = merged_df.loc[
(merged_df["timestamp"] >= merged_df["start"])
& (merged_df["timestamp"] <= merged_df["end"])
]
In which I get the output I wish to have:
end start id timestamp val
0 2019-08-31 2019-08-27 1234 2019-08-30 10:00 AAAB
1 2019-08-31 2019-08-27 1234 2019-08-28 10:00 ABBA
3 2019-08-31 2019-08-27 1234 2019-08-29 18:00 BAAB
4 2019-08-28 2019-08-22 8679 2019-08-27 10:30 CXXC
7 2019-09-09 2019-08-04 8679 2019-08-27 10:30 CXXC
19 2019-09-14 2019-09-10 8679 2019-09-12 10:00 XCXC
20 2019-09-14 2019-09-10 8679 2019-09-11 14:00 CCXX
My question: I need to do the same merge and get the same results but df1 is 200K rows and df2 is 600K.
What I have tried so far:
The classic way of merge and filter, as above, will fail because the initial merge will create a huge data frame that will overload the memory.
I also tried the pandasql approach which ended with my 16GB RAM PC
getting stuck.I tried the merge_asof in 3 steps of left join, right join and outer join as
explained here but I run some tests and it seems to always
return up to 2 records from df2 to a single line in df1.
Any good advice will be appreciated!