Suppose I have two Dataframes with different sizes:
df1 = pd.DataFrame(dict(xlow=np.linspace(0, 10, 11),
xup=np.linspace(1.0, 11, 11)))
df2 = pd.DataFrame(dict(x=np.random.uniform(low=1, high=10, size=(20,)),
volume=np.random.randint(0, 10, size=20)))
to which I have:
df1:
xlow xup
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 4.0 5.0
5 5.0 6.0
6 6.0 7.0
7 7.0 8.0
8 8.0 9.0
9 9.0 10.0
10 10.0 11.0
and:
df2:
x volume
0 1.632789 8
1 8.346898 7
2 1.372285 2
3 1.946896 9
4 7.047305 0
5 3.851938 4
6 2.439664 7
7 8.823509 1
8 1.136700 1
9 8.766352 8
10 2.135441 8
11 8.092385 4
12 6.532898 3
13 7.199914 2
14 1.036684 0
15 9.714326 1
16 5.964111 0
17 9.625200 2
18 9.999818 6
19 9.891857 1
Now I want to add a third column to df1 say total_volume, where it is the summation of the volume that lie between individual row of xlow and xup of df1. I can do this using:
df1['total_volume']=df1.apply(lambda row: df2[(df2.x<=row['xup']) & (df2.x>row['xlow'])].volume.sum(),axis=1)
which results in
xlow xup total_volume
0 0.0 1.0 0
1 1.0 2.0 20
2 2.0 3.0 15
3 3.0 4.0 4
4 4.0 5.0 0
5 5.0 6.0 0
6 6.0 7.0 3
7 7.0 8.0 2
8 8.0 9.0 20
9 9.0 10.0 10
10 10.0 11.0 0
we can check the value of say the second row as:
df2[(df2.x<=2) & (df2.x>1) ].volume.sum()=20
In reality, my df1 can have up to hundreds of thousands of rows, so it can take up to tens of minutes to complete. Is there a more vectorize/pythonic way of doing this. I tried pandas merge and join but were not successful, most likely because I am still a novice.
Thank you for all your help!