Conditional merge / join of two large Pandas DataFrames with duplicated keys based on values of multiple columns - Python

Question

I come from R and honestly, this is the simplest thing to do in one line using R data.tables, and the operation is also quite fast for large datatables. Bu I'm really struggling implementing it in Python. None of the use cases previous mentioned were suitable for my application. The major issue at hand is the memory usage in the Python solution as i will explain below.

The problem: I've got two large DataFrames df1 and df2 (each around 50M-100M rows) and I need to merge two (or n) columns of df2 to df1 based on two condition:

1) df1.id = df2.id (usual case of merge)

2) df2.value_2A <= df1.value_1 <= df2.value_2B

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})

df1
Out[13]: 
   id  value_1
0   1        2
1   1        5
2   1        7
3   2        1
4   2        3
5   3        4

df2
Out[14]: 
   id  value_2A  value_2B
0   1         0         1
1   1         3         5
2   1         7         9
3   1        12        15
4   2         0         1
5   2         2         4
6   2         3         6
7   3         1         3

desired_output
Out[15]: 
   id  value_1  value_2A  value_2B
0   1        2       NaN       NaN
1   1        5       3.0       5.0
2   1        7       7.0       9.0
3   2        1       0.0       1.0
4   2        3       2.0       4.0
5   2        3       3.0       6.0
6   3        4       NaN       NaN

now, i know this can be done by first merging df1 and df2 the 'left' way and then filtering the data. But this is a horrendous solution in terms of scaling. I've got 50M x 50M rows with multiple duplicates of id. This would create some enormous dataframe which i would have to filter.

## This is NOT a solution because memory usage is just too large and 
## too many oprations deeming it extremely inefficient and slow at large scale

output = pd.merge(df1, df2, on='id', how='left')  ## output becomes very large in my case
output.loc[~((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])), ['value_2A', 'value_2B']] = np.nan
output = output.loc[~ output['value_2A'].isnull()]
output = pd.merge(df1, output, on=['id', 'value_1'], how='left')

This is so inefficient. I'm merging a large dataset twice to get the desired output and creating massive dataframes while doing so. Yuck!

Think of this as two dataframes of events, which i'm trying to match together. That is, tagging if events of df1 have occurred within events of df2. there are multiple events for each id in both df1 and df2. events of df2 are NOT mutually exclusive. The conditional join really needs to happen at the time of joining, not after. This is done easily in R:

## in R realm ##
require(data.table)
desired_output <- df2[df1, on=.(id, value_2A <= value_1, value_2B >= value_1)] #fast and easy operation

is there any way to do this in Python?

yes, i had to create my own index, filter it, and then merge it with df1. The best way I found is to get the unique pairs of each dataframe, separately right join unique df1 and unique df2 (this will create the smallest possible dataset), filter the join based on the condition, and the merge the resultant index back with df1, in a left join of df1 and the index. Hope that makes sense. — Ankhnesmerira, Aug 29 '22 at 03:13
so, the pseudo-code would look something like this: { a = unique(df1), b = unique (df2), c = merge(a,b, right join), index = filter(c) based on the condition, result = merge(df1, index, left join) } — Ankhnesmerira, Aug 29 '22 at 03:15

score 2 · Answer 1 · answered May 26 '22 at 12:16

Yeah. It's an annoying problem. I handled this by splitting the left DataFrame into chunks.

def merge_by_chunks(left, right, condition=None, **kwargs):   
    chunk_size = 1000
    merged_chunks = []
    for chunk_start in range(0, len(left), chunk_size):
        print(f"Merged {chunk_start}            ", end="\r")
        merged_chunk = pd.merge(left=left[chunk_start: chunk_start+chunk_size], right=right, **kwargs)
        if condition is not None:
            merged_chunk = merged_chunk[condition(merged_chunk)]
        merged_chunks.append(merged_chunk)
    return pd.concat(merged_chunks)

Then you can provide the condition as a function.

df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})

def condition_func(output):
    return (((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])))

output = merge_by_chunks(df1, df2, condition=condition_func, on='id', how='left')
merge_by_chunks(df1, output, on=['id', 'value_1'], how='left')

It can be pretty slow depending on the size of the DataFrame, but it doesn't run out of memory.

score 1 · Answer 2 · answered May 22 '20 at 08:47

1

interesting question!

Looks like pandasql might do what you want. Please see : How to do a conditional join in python Pandas?

answered May 22 '20 at 08:47

Jacek Tomaka

422
7
15

1

I have tried that already. it uses the same amount of memory in the process and it's super slow. :( – Ankhnesmerira May 26 '20 at 06:36
How many rows is in your joined table? And what fields are selected? – Jacek Tomaka May 27 '20 at 16:11

Conditional merge / join of two large Pandas DataFrames with duplicated keys based on values of multiple columns - Python

2 Answers2

Linked