2

Suppose I have two Dataframes with different sizes:

df1 = pd.DataFrame(dict(xlow=np.linspace(0, 10, 11), 
                           xup=np.linspace(1.0, 11, 11))) 
df2 = pd.DataFrame(dict(x=np.random.uniform(low=1, high=10, size=(20,)), 
                           volume=np.random.randint(0, 10, size=20)))

to which I have:

df1: 
    xlow   xup
0    0.0   1.0
1    1.0   2.0
2    2.0   3.0
3    3.0   4.0
4    4.0   5.0
5    5.0   6.0
6    6.0   7.0
7    7.0   8.0
8    8.0   9.0
9    9.0  10.0
10  10.0  11.0

and:

df2:
         x  volume
0   1.632789       8
1   8.346898       7
2   1.372285       2
3   1.946896       9
4   7.047305       0
5   3.851938       4
6   2.439664       7
7   8.823509       1
8   1.136700       1
9   8.766352       8
10  2.135441       8
11  8.092385       4
12  6.532898       3
13  7.199914       2
14  1.036684       0
15  9.714326       1
16  5.964111       0
17  9.625200       2
18  9.999818       6
19  9.891857       1

Now I want to add a third column to df1 say total_volume, where it is the summation of the volume that lie between individual row of xlow and xup of df1. I can do this using:

df1['total_volume']=df1.apply(lambda row: df2[(df2.x<=row['xup']) & (df2.x>row['xlow'])].volume.sum(),axis=1)

which results in 

    xlow   xup  total_volume
0    0.0   1.0             0
1    1.0   2.0            20
2    2.0   3.0            15
3    3.0   4.0             4
4    4.0   5.0             0
5    5.0   6.0             0
6    6.0   7.0             3
7    7.0   8.0             2
8    8.0   9.0            20
9    9.0  10.0            10
10  10.0  11.0             0

we can check the value of say the second row as:

df2[(df2.x<=2) & (df2.x>1) ].volume.sum()=20

In reality, my df1 can have up to hundreds of thousands of rows, so it can take up to tens of minutes to complete. Is there a more vectorize/pythonic way of doing this. I tried pandas merge and join but were not successful, most likely because I am still a novice.

Thank you for all your help!

tn22288
  • 35
  • 2
  • Is possible groups in `df1` are overlapping? – jezrael Oct 19 '20 at 05:10
  • Possible duplicate: [Best way to join / merge by range in pandas](https://stackoverflow.com/questions/44367672/best-way-to-join-merge-by-range-in-pandas) – รยקคгรђשค Oct 19 '20 at 06:12
  • Thank you for the link. Yes, the answer in that link works too, just like jezrael's answer below. I went even further and try it on my full dataset, which is about 500000 rows for df1 and 20000 rows for df2. The memory usage is quite heavy with np.where in this case, making it much slower than simply open a for loop over df1 rows. – tn22288 Oct 20 '20 at 18:00

2 Answers2

1

If bins are not overlapping is possible use cut with aggregate sum and then add to df1 by DataFrame.join:

df2['g'] = pd.cut(df2['x'], bins=[0] + df1['xup'].tolist(), labels=df1['xup'])

df2 = df1.join(df2.groupby('g')['volume'].sum(), on='xup')
print (df2)

    xlow xup  volume
0    0.0   1       0
1    1.0   2      20
2    2.0   3      15
3    3.0   4       4
4    4.0   5       0
5    5.0   6       0
6    6.0   7       3
7    7.0   8       2
8    8.0   9      20
9    9.0  10      10
10  10.0  11       0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you for your quick response. The bins are not overlapping, so it works like a charm. While I might have to open a new thread, but I might as well try to ask the question here. Is there any way to extend this into three dimensions? What I mean is df1 would have 6 columns: xup,xlow,yup,ylow,zup,zlow while df2 has 3 columns of x, y, z. The binning condition then becomes [xlow – tn22288 Oct 20 '20 at 17:54
  • @tn22288 I am offline, on phone only so best is post new question. – jezrael Oct 20 '20 at 18:18
0

Do each rows of df1 have same range? Then iterate df2 and simply add to df1.

for d in df2:
    df1[d.x//1].total_volume += d.volume

If not, sort df2.

Canasta
  • 228
  • 1
  • 6