2

I have a archive of data. (xx,yy,EXTRA) and I want to divide the data into grids of equal size. For example, lets suppose that the data is:

xx=np.array([0.1,  0.2,   3,   4.1,  3, 0.1])  
yy=np.array([0.35, 0.15, 1.5,  4.5, 3.5, 3])
EXTRA=np.array([0.01,0.003,2.002,4.004,0.5,0.2])

I want to make square grids of size 1x1, and after obtain the sum of "EXTRA" for every point on the grid.

This is what I tried

import math

for i in range(0,5):   
    for j in range(0,5):
        for x,y in zip(xx,yy):
           k=math.floor(x)
           kk=math.floor(y)
           if i<=k<i+1.0 and j<=kk<j+1.0:
               print("(x,y)=" ,x,",",y,",","(i,j)=",i,",",j ,"Unkow sum of EXTRA")

I obtain as output

(x,y)= 0.1 , 0.35 , (i,j)= 0 , 0 Unkow sum of extra
(x,y)= 0.2 , 0.15 , (i,j)= 0 , 0 Unkow sum of extra
(x,y)= 0.1 , 3.0 , (i,j)= 0 , 3 Unkow sum of extra
(x,y)= 3.0 , 1.5 , (i,j)= 3 , 1 Unkow sum of extra
(x,y)= 3.0 , 3.5 , (i,j)= 3 , 3 Unkow sum of extra
(x,y)= 4.1 , 4.5 , (i,j)= 4 , 4 Unkow sum of extra

So, the first two points have coordinates (0.1,0.35) and (0.2,0.15) and are inside the cuadrant (0,0). Looking in "EXTRA" I know that in the cuadrant (0,0) I should obtain that the sum of "EXTRA" should be Sum_extra= 0.01+0.003. However I can't figure out how to make that sum in terms of code.

More information

My real problem is that I have "particles" inside a big cubic box, and I want to subdivide the box in smaller boxes, and in each one of the smaller boxes I want to obtain the sum of their "mass", in my example "EXTRA=mass".

I suspect that the way I classify whether a particle belongs to a quadrant is slow, which would suppose a problem since I have a lot of data.Any suggestions will be appreciated.

martineau
  • 119,623
  • 25
  • 170
  • 301
Cruz
  • 133
  • 12
  • `I suspect that ... is slow` - did you do any testing to validate this? – wwii Oct 17 '20 at 17:42
  • Not yet, I'm working with smaller samples before I do the full work. However I think that I find a simpler way to do what I want, I will post the solution as a comment if it works. – Cruz Oct 17 '20 at 17:47
  • You can also take a look at [my research](https://stackoverflow.com/questions/59239886/what-is-the-fastest-way-to-map-group-names-of-numpy-array-to-indices) of the fastest solution in 3D. – mathfux Oct 17 '20 at 19:03
  • `pandas` appears to win here but you can achieve 2x - 3x speed-ups if you use dimensionality reduction. – mathfux Oct 17 '20 at 19:06

1 Answers1

2

Combine the three arrays with zip and sort the result on the xx and yy values. Then group that by the xx and yy values. Get the sum of the EXTRA values for each group.

import operator, itertools
important = operator.itemgetter(0,1)
xtra = operator.itemgetter(-1)
data = sorted(zip(xx.astype(int),yy.astype(int),EXTRA),key=important)
gb = itertools.groupby(data,important)
for key,group in gb:
    values = list(map(xtra,group))
    print(key,values,sum(values))
    # or just
    #print(key,sum(map(xtra,group)))

Same concept using a Pandas DataFrame.

import pandas as pd
xx, yy = xx.astype(int),yy.astype(int)

In [25]: df = pd.DataFrame({'xx':xx,'yy':yy,'EXTRA':EXTRA})

In [26]: df.groupby(['xx','yy'])['EXTRA'].sum()
Out[26]: 
xx  yy
0   0     0.013
    3     0.200
3   1     2.002
    3     0.500
4   4     4.004
Name: EXTRA, dtype: float64
wwii
  • 23,232
  • 7
  • 37
  • 77