In summary, I am working on a fitting program in python using numpy and scipy primarily. I will call the result of my fit y, and the target value f.
The issue is that for each value y, several compents of the value must be summed. I need to do this in an efficient method for large data structures.
I currently have 2 relevant arrays.
y_all - 1D array of the componets for all values of y
y_idx - 1D array of indices of y_all corresponding to each value of y
It is worth mentioning that y_all is ordered, and the components are grouped together.
I need a way of calculating y.
Example problem:
y_all = [1,1,1,2,2,2,2,3,3]
y_idx = [0,0,1,1,1,1,2,2,3]
y -> [2,7,5,3]
Methods tried:
Very rudimentary numpy hstack/where comprehension
Scipy sparse matrix full of 1s rows corresponding to y_idx, which is multiplied on y_all.
Scipy sparse matrix to numpy array with y_all given as the data, and the rows are summed.
These solutions are all effective, but run slowly on large datasets.
My current dataset has len(y) = ~4000, len(y_all) = ~100000
, with the plan being to approximately double both.