I come from R and honestly, this is the simplest thing to do in one line using R data.tables, and the operation is also quite fast for large datatables. Bu I'm really struggling implementing it in Python. None of the use cases previous mentioned were suitable for my application. The major issue at hand is the memory usage in the Python solution as i will explain below.
The problem: I've got two large DataFrames df1 and df2 (each around 50M-100M rows) and I need to merge two (or n) columns of df2 to df1 based on two condition:
1) df1.id = df2.id (usual case of merge)
2) df2.value_2A <= df1.value_1 <= df2.value_2B
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
df1
Out[13]:
id value_1
0 1 2
1 1 5
2 1 7
3 2 1
4 2 3
5 3 4
df2
Out[14]:
id value_2A value_2B
0 1 0 1
1 1 3 5
2 1 7 9
3 1 12 15
4 2 0 1
5 2 2 4
6 2 3 6
7 3 1 3
desired_output
Out[15]:
id value_1 value_2A value_2B
0 1 2 NaN NaN
1 1 5 3.0 5.0
2 1 7 7.0 9.0
3 2 1 0.0 1.0
4 2 3 2.0 4.0
5 2 3 3.0 6.0
6 3 4 NaN NaN
now, i know this can be done by first merging df1 and df2 the 'left' way and then filtering the data. But this is a horrendous solution in terms of scaling. I've got 50M x 50M rows with multiple duplicates of id. This would create some enormous dataframe which i would have to filter.
## This is NOT a solution because memory usage is just too large and
## too many oprations deeming it extremely inefficient and slow at large scale
output = pd.merge(df1, df2, on='id', how='left') ## output becomes very large in my case
output.loc[~((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])), ['value_2A', 'value_2B']] = np.nan
output = output.loc[~ output['value_2A'].isnull()]
output = pd.merge(df1, output, on=['id', 'value_1'], how='left')
This is so inefficient. I'm merging a large dataset twice to get the desired output and creating massive dataframes while doing so. Yuck!
Think of this as two dataframes of events, which i'm trying to match together. That is, tagging if events of df1 have occurred within events of df2. there are multiple events for each id in both df1 and df2. events of df2 are NOT mutually exclusive. The conditional join really needs to happen at the time of joining, not after. This is done easily in R:
## in R realm ##
require(data.table)
desired_output <- df2[df1, on=.(id, value_2A <= value_1, value_2B >= value_1)] #fast and easy operation
is there any way to do this in Python?