Sum of numpy array by indexes

Question

In summary, I am working on a fitting program in python using numpy and scipy primarily. I will call the result of my fit y, and the target value f.

The issue is that for each value y, several compents of the value must be summed. I need to do this in an efficient method for large data structures.

I currently have 2 relevant arrays.

y_all - 1D array of the componets for all values of y

y_idx - 1D array of indices of y_all corresponding to each value of y

It is worth mentioning that y_all is ordered, and the components are grouped together.

I need a way of calculating y.

Example problem:

y_all = [1,1,1,2,2,2,2,3,3]
y_idx = [0,0,1,1,1,1,2,2,3]

y -> [2,7,5,3]

Methods tried:

Very rudimentary numpy hstack/where comprehension

Scipy sparse matrix full of 1s rows corresponding to y_idx, which is multiplied on y_all.

Scipy sparse matrix to numpy array with y_all given as the data, and the rows are summed.

These solutions are all effective, but run slowly on large datasets.

My current dataset has len(y) = ~4000, len(y_all) = ~100000, with the plan being to approximately double both.

`y = np.zeros(y_idx.max()+1, dtype=y_all.dtype); np.add.at(y, y_idx, y_all)`, assuming *y_all*, *y_idx* are `np.array`, not `list`. Please format your question appropriately and use an example with suitable types. — Michael Szczesny, Mar 18 '23 at 06:52
Please add the code for the approaches you have already tried. You are asking for a faster solution, we can't benchmark against some vague descriptions. — Michael Szczesny, Mar 18 '23 at 07:03

score 2 · Answer 1 · answered Mar 18 '23 at 06:37

2

Use reduceat:

# set up data
y_all = np.array([1,1,1,2,2,2,2,3,3])
y_idx = np.array([0,0,1,1,1,1,2,2,3])
y_target = np.array([2,7,5,3])

# ensure the groups are sorted
# optional here as the data is already sorted
order = np.argsort(y_idx)
y_all2 = y_all[order]
y_idx2 = y_idx[order]

# compute the grouped sum
y = np.add.reduceat(y_all2, np.r_[0, np.nonzero(np.diff(y_idx2))[0]+1])

Output:

array([2, 7, 5, 3])

answered Mar 18 '23 at 06:37

mozway

194,879
13
39
75

Is `np.add.at` still too slow in comparison? Would save sorting two arrays and constructing the diff-index array. – Michael Szczesny Mar 18 '23 at 06:55
@MichaelSzczesny no it's a good approach, I just didn't think of it. You should post a real answer ;) – mozway Mar 18 '23 at 08:50

Ayush · Answer 2 · 2023-03-18T05:01:44.210

0

Take a look at this Group by in numpy

You can also use Pandas and use its groupBy method.

The y_idx array can be used to perform groupBy and then sum.

edited Mar 18 '23 at 05:01

answered Mar 18 '23 at 04:58

Ayush

1
1

On a lighter note .... ask GPT :) – Ayush Mar 18 '23 at 05:00
Please do not promote GPT here, GPT programming answers are very often incorrect and a waste of time for the community – mozway Mar 18 '23 at 06:43

Sum of numpy array by indexes

2 Answers2