0

I have a 2d numpy array with repeated values in first column. The repeated values can have any corresponding value in second column.

Its easy to find the cumsum using numpy, but, I have to find the cumsum for all the repeated values.

How can we do this effectively using numpy or pandas?

Here, I have solved the problem using ineffective for-loop. I was wondering if there is a more elegant solution.

Question How can we get the same result in more effective fashion?

Help will be appreciated.

#!python
# -*- coding: utf-8 -*-#
#
# Imports
import pandas as pd
import numpy as np
np.random.seed(42)  # make results reproducible

aa = np.random.randint(1, 20, size=10).astype(float)
bb = np.arange(10)*0.1

unq = np.unique(aa)

ans = np.zeros(len(unq))
print(aa)
print(bb)
print(unq)

for i, u in enumerate(unq):
    for j, a in enumerate(aa):
        if a == u:
            print(a, u)
            ans[i] += bb[j]

print(ans)


"""
# given data
idx  col0  col1
0    7.    0.0 
1    15.   0.1
2    11.   0.2
3    8.    0.3
4    7.    0.4
5    19.   0.5
6    11.   0.6
7    11.   0.7
8    4.    0.8
9    8.    0.9


# sorted data
4.    0.8
7.    0.0
7.    0.4
8.    0.9
8.    0.3
11.   0.6
11.   0.7
11.   0.2
15.   0.1
19.   0.5

# cumulative sum for repeated serial
4.    0.8
7.    0.0 + 0.4
8.    0.9 + 0.3
11.   0.6 + 0.7 + 0.2
15.   0.1
19.   0.5

# Required answer
4.    0.8 
7.    0.4    
8.    1.2
11.   1.5
15.   0.1
19.   0.5
"""
  • I think you are looking to do a `groupby()`... – pault Jun 07 '18 at 17:27
  • If `aa` is integer and from a not too large interval you can use `np.bincount(aa, bb, aa.max()+1)` I didn't benchmark it but would expect it too typically to scale better than pandas. – Paul Panzer Jun 07 '18 at 18:21

2 Answers2

3

You can groupby col0 and find the .sum() for col1.

df.groupby('col0')['col1'].sum()

Output:

col0
4.0     0.8
7.0     0.4
8.0     1.2
11.0    1.5
15.0    0.1
19.0    0.5
Name: col1, dtype: float64
harvpan
  • 8,571
  • 2
  • 18
  • 36
  • 1
    Thanks a lot, can we also do it in numpy ? –  Jun 07 '18 at 17:30
  • 1
    @astro123, [this](https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function) question talks about numpy approach. – harvpan Jun 07 '18 at 17:35
2

I think a pandas method such as the one offered by @HarvIpan is best for readability and functionality, but since you asked for a numpy method as well, here is a way to do it in numpy using a list comprehension, which is more succinct than your original loop:

np.array([[i,np.sum(bb[np.where(aa==i)])] for i in np.unique(aa)])

which returns:

array([[  4. ,   0.8],
       [  7. ,   0.4],
       [  8. ,   1.2],
       [ 11. ,   1.5],
       [ 15. ,   0.1],
       [ 19. ,   0.5]])
sacuL
  • 49,704
  • 8
  • 81
  • 106