3

I know there are many questions like this one but I can't seem to find the relevant answer. Let's say I have 2 data frames as follow:

df1 = pd.DataFrame(
    {
        "end": [
            "2019-08-31",
            "2019-08-28",
            "2019-09-09",
            "2019-09-08",
            "2019-09-14",
            "2019-09-14",
        ],
        "start": [
            "2019-08-27",
            "2019-08-22",
            "2019-08-04",
            "2019-09-02",
            "2019-09-06",
            "2019-09-10",
        ],
        "id": [1234, 8679, 8679, 1234, 1234, 8679],
    }
)

df2 = pd.DataFrame(
    {
        "timestamp": [
            "2019-08-30 10:00",
            "2019-08-28 10:00",
            "2019-08-27 10:30",
            "2019-08-07 12:00",
            "2019-09-12 10:00",
            "2019-09-11 14:00",
            "2019-08-29 18:00",
        ],
        "id": [1234, 1234, 8679, 1234, 8679, 8679, 1234],
        "val": ["AAAB", "ABBA", "CXXC", "BBAA", "XCXC", "CCXX", "BAAB"],
    }
)

df1["end"] = pd.to_datetime(df1["end"])
df1["start"] = pd.to_datetime(df1["start"])

df2["timestamp"] = pd.to_datetime(df2["timestamp"])

df1.sort_values(by=["end"], inplace=True)
df2.sort_values(by="timestamp", inplace=True)

Resulted as:

 end       start    id
0  2019-08-31  2019-08-27  1234
1  2019-08-28  2019-08-22  8679
2  2019-09-09  2019-08-04  8679
3  2019-09-08  2019-09-02  1234
4  2019-09-14  2019-09-06  1234
5  2019-09-14  2019-09-10  8679

 timestamp    id   val
0  2019-08-30 10:00  1234  AAAB
1  2019-08-28 10:00  1234  ABBA
2  2019-08-27 10:30  8679  CXXC
3  2019-08-07 12:00  1234  BBAA
4  2019-09-12 10:00  8679  XCXC
5  2019-09-11 14:00  8679  CCXX
6  2019-08-29 18:00  1234  BAAB

The classic way to merge by ID so timestamp will be between start and end in df1 is by merge on id or dummy variable and filter:

merged_df = pd.merge(df1, df2, how="left", on="id")
merged_df = merged_df.loc[
    (merged_df["timestamp"] >= merged_df["start"])
    & (merged_df["timestamp"] <= merged_df["end"])
]

In which I get the output I wish to have:

           end       start    id         timestamp   val
0   2019-08-31  2019-08-27  1234  2019-08-30 10:00  AAAB
1   2019-08-31  2019-08-27  1234  2019-08-28 10:00  ABBA
3   2019-08-31  2019-08-27  1234  2019-08-29 18:00  BAAB
4   2019-08-28  2019-08-22  8679  2019-08-27 10:30  CXXC
7   2019-09-09  2019-08-04  8679  2019-08-27 10:30  CXXC
19  2019-09-14  2019-09-10  8679  2019-09-12 10:00  XCXC
20  2019-09-14  2019-09-10  8679  2019-09-11 14:00  CCXX

My question: I need to do the same merge and get the same results but df1 is 200K rows and df2 is 600K.

What I have tried so far:

  • The classic way of merge and filter, as above, will fail because the initial merge will create a huge data frame that will overload the memory.

  • I also tried the pandasql approach which ended with my 16GB RAM PC
    getting stuck.

  • I tried the merge_asof in 3 steps of left join, right join and outer join as
    explained here but I run some tests and it seems to always
    return up to 2 records from df2 to a single line in df1.

Any good advice will be appreciated!

Niv Cohen
  • 1,078
  • 2
  • 11
  • 21
  • 1
    Since you only care about a specific subset and not the whole dataframe, have you tried filtering first on df1 and df2 and then use the left merge/join/whatever? I guess it will work if your range is not the whole dataframe. I guess you don't really need to merge the whole dataframe if you only want a subset. – tchar Aug 15 '21 at 14:47
  • Do the intervals overlap? Interval index should do it, however with overlapping indices, it might be a bit tricky – sammywemmy Aug 15 '21 at 21:59
  • oh one other thing, are there duplicates in df2? – sammywemmy Aug 15 '21 at 23:33
  • in df2 there are no duplicates, but several incidents in df2 can fit a single record in df1. as you can see indexes 0,1,3 in the output example – Niv Cohen Aug 16 '21 at 07:20
  • why dont you dump it into sql and run it? or use python's pypolar library, or R's data.table (they have support of non-equi joins, which is quite efficient as well). – sammywemmy Aug 16 '21 at 10:59

2 Answers2

1

Perhaps you can make a function with groupby and find the matching date range with pd.IntervalIndex so you don't have to merge:

def func():
    for x, y in df2.groupby("id"):
        tmp = df1.loc[df1["id"].eq(x)]
        tmp.index = pd.IntervalIndex.from_arrays(tmp['start'], tmp['end'], closed='both')
        y[["start", "end"]] = tmp.loc[y.timestamp, ["start", "end"]].to_numpy()
        yield y

print (pd.concat(func()).sort_index())

            timestamp    id   val      start        end
0 2019-08-30 10:00:00  1234  AAAB 2019-08-27 2019-08-31
1 2019-08-28 10:00:00  1234  ABBA 2019-08-27 2019-08-31
2 2019-08-07 10:30:00  8679  CXXC 2019-08-04 2019-09-09
3 2019-08-27 12:00:00  1234  BBAA 2019-08-27 2019-08-31
4 2019-09-12 10:00:00  8679  XCXC 2019-09-10 2019-09-14
5 2019-09-11 14:00:00  8679  CCXX 2019-09-10 2019-09-14
6 2019-08-29 18:00:00  1234  BAAB 2019-08-27 2019-08-31
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
  • Thanks, It seems like a good approach but I can't seem to make it work on my 'real' data. I keep having a general "key error" in this line: ```y[["start", "end"]]``` – Niv Cohen Aug 16 '21 at 07:22
1

I've been working with niv-dudovitch and david-arenburg on this one, and here are our findings which I hope will be helpful to some of you out there... The core idea was to prevent growing objects in memory by creating a list of dataframes based on subsets of the data.

First version without multi-processing.

import pandas as pd

unk = df1.id.unique()
j = [None] * len(unk)
k = 0

df1.set_index('id', inplace = True)
df2.set_index('id', inplace = True)

for i in unk:
    tmp = df1.loc[df1.index.isin([i])].join(df2.loc[df2.index.isin([i])], how='left')  
    j[k] = tmp.loc[tmp['timestamp'].between(tmp['start'], tmp['end'])]
    k += 1
    
res = pd.concat(j)
res

Using Multi-Process

In our real case, we have 2 large data frame df2 is about 3 million rows and df1 is slightly above 110K. The output is about 20M rows.

import multiprocessing as mp
import itertools
import concurrent
from concurrent.futures import ProcessPoolExecutor
import time
import pandas as pd
from itertools import repeat


def get_val_between(ids, df1, df2):
    """
    Locate all values between 2 dates by id
    Args:
        - ids (list): list of ids

    Returns:
        - concat list of dataframes
    """

    j = [None] * len(ids)
    k = 0

    for i in ids:
        tmp = df1.loc[df1.index.isin([i])].join(
            df2.loc[df2.index.isin([i])], how="left"
        )
        tmp = tmp.loc[tmp["timestamp"].between(tmp["start"], tmp["end"])]

        # add to list in location k

        j[k] = tmp
        k += 1
    # keep only not None dfs in j

    j = [i for i in j if i is not None]
    if len(j) > 0:
        return pd.concat(j)
    else:
        return None


def grouper(n, iterable, fillvalue=None):
    """grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"""

    args = [iter(iterable)] * n
    return itertools.zip_longest(fillvalue=fillvalue, *args)


def main():

    df1.reset_index(inplace=True, drop=True)
    df2.reset_index(inplace=True, drop=True)
    id_lst = df1.id.unique()
    iter_ids = grouper(10, list(id_lst))

    df1.set_index("id", inplace=True)
    df2.set_index("id", inplace=True)

    # set multi-processes

    executor = concurrent.futures.ProcessPoolExecutor(20)
    result_futures = executor.map(get_val_between, iter_ids, repeat(df1), repeat(df2))
    concurrent.futures.as_completed(result_futures)
    result_concat = pd.concat(result_futures)

    print(result_concat)


if __name__ == "__main__":
    main()

results as expected:

           end      start           timestamp   val
id                                                  
8679 2019-08-28 2019-08-22 2019-08-27 10:30:00  CXXC
8679 2019-09-09 2019-08-04 2019-08-27 10:30:00  CXXC
8679 2019-09-14 2019-09-10 2019-09-11 14:00:00  CCXX
8679 2019-09-14 2019-09-10 2019-09-12 10:00:00  XCXC
1234 2019-08-31 2019-08-27 2019-08-28 10:00:00  ABBA
1234 2019-08-31 2019-08-27 2019-08-29 18:00:00  BAAB
1234 2019-08-31 2019-08-27 2019-08-30 10:00:00  AAAB

As a benchmark with an output of 20 million rows, the Multi-Process approach is x10 times faster.

Niv Cohen
  • 1,078
  • 2
  • 11
  • 21