-1

In summary, I am working on a fitting program in python using numpy and scipy primarily. I will call the result of my fit y, and the target value f.

The issue is that for each value y, several compents of the value must be summed. I need to do this in an efficient method for large data structures.

I currently have 2 relevant arrays.

y_all - 1D array of the componets for all values of y

y_idx - 1D array of indices of y_all corresponding to each value of y

It is worth mentioning that y_all is ordered, and the components are grouped together.

I need a way of calculating y.

Example problem:

y_all = [1,1,1,2,2,2,2,3,3]
y_idx = [0,0,1,1,1,1,2,2,3]
y -> [2,7,5,3]

Methods tried:

Very rudimentary numpy hstack/where comprehension

Scipy sparse matrix full of 1s rows corresponding to y_idx, which is multiplied on y_all.

Scipy sparse matrix to numpy array with y_all given as the data, and the rows are summed.

These solutions are all effective, but run slowly on large datasets.

My current dataset has len(y) = ~4000, len(y_all) = ~100000, with the plan being to approximately double both.

Michael Szczesny
  • 4,911
  • 5
  • 15
  • 32
  • `y = np.zeros(y_idx.max()+1, dtype=y_all.dtype); np.add.at(y, y_idx, y_all)`, assuming *y_all*, *y_idx* are `np.array`, not `list`. Please format your question appropriately and use an example with suitable types. – Michael Szczesny Mar 18 '23 at 06:52
  • Please add the code for the approaches you have already tried. You are asking for a faster solution, we can't benchmark against some vague descriptions. – Michael Szczesny Mar 18 '23 at 07:03

2 Answers2

2

Use reduceat:

# set up data
y_all = np.array([1,1,1,2,2,2,2,3,3])
y_idx = np.array([0,0,1,1,1,1,2,2,3])
y_target = np.array([2,7,5,3])

# ensure the groups are sorted
# optional here as the data is already sorted
order = np.argsort(y_idx)
y_all2 = y_all[order]
y_idx2 = y_idx[order]

# compute the grouped sum
y = np.add.reduceat(y_all2, np.r_[0, np.nonzero(np.diff(y_idx2))[0]+1])

Output:

array([2, 7, 5, 3])
mozway
  • 194,879
  • 13
  • 39
  • 75
0

Take a look at this Group by in numpy

You can also use Pandas and use its groupBy method.

The y_idx array can be used to perform groupBy and then sum.

Ayush
  • 1
  • 1