0

I have two arrays:

index  = [2,1,0,0,1,1,1,2]
values = [1,2,3,4,5,4,3,2]

I would like to produce:

[sum(v for i,v in zip(index, values) if i == ui) for i in sorted(set(index))]

in the most efficient way possible.

  • my values are computed via autograd
  • doing a groupby in pandas is really not efficient because of the point above
  • I have to do it hundreds of times on the same index but with different values
  • len(values) ~ 10**7
  • len(set(index)) ~ 10**6
  • Counter(index).most_common(1)[0][1] ~ 1000

I think a pure numpy solution would be the best.

I tried to precompute the reduced version of index, and then do:

[values[l].sum() for l in reduced_index]

but it is not efficient enough.

Here is a minimal code sample:

import numpy as np
import autograd.numpy as anp
from autograd import grad
import pandas as pd

EASY = True

if EASY:
    index = np.random.randint(10, size=10**3)
    values = anp.random.rand(10**3) * 2 - 1
else:
    index = np.random.randint(1000, size=10**7)
    values = anp.random.rand(10**7) * 2 - 1


# doesn't work
def f1(values):
    return anp.exp(anp.bincount(index, weights=values)).sum()


index_unique = sorted(set(index))
index_map = {j: i for i, j in enumerate(index_unique)}
index_mapped = [index_map[i] for i in index]
index_lists = [[] for _ in range(len(index_unique))]
for i, j in enumerate(index_mapped):
    index_lists[j].append(i)


def f2(values):
    s = anp.array([values[l].sum() for l in index_lists])
    return anp.exp(s).sum()


ans = grad(f2)(values)
Labo
  • 2,482
  • 2
  • 18
  • 38

1 Answers1

2

If your index are non negative integers, you can use np.bincount with values as weights:

np.bincount(index, weights=values)
# array([ 7., 14.,  3.])

This gives the sum at each position from 0 to max(index).

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 1
    Thanks, it seems to be really the way to go with standard numpy code. I'll test with autograd! – Labo Jun 19 '18 at 23:43
  • So, it doesn't work and gives the error "ValueError: setting an array element with a sequence.". Basically, with autograd, you are not allowed to do something like `a[i] = v` if `a` is an autograd object. I still upvote your answer because it it very efficient for standard numpy code and I didn't know about it. – Labo Jun 20 '18 at 00:00
  • Not very familiar with `autograd`. But maybe [convert `autograd` to numpy](https://stackoverflow.com/questions/44340848/how-to-convert-pytorch-autograd-variable-to-numpy). – Psidom Jun 20 '18 at 00:14
  • I'm using https://github.com/HIPS/autograd, and no precisely because I want to be able to differenciate. – Labo Jun 20 '18 at 00:18
  • I added an example! – Labo Jun 20 '18 at 00:21
  • You must get the error somewhere else. The line `anp.exp(anp.bincount(index, weights=values)).sum(); 4384.2393731406264` works for me. – Psidom Jun 20 '18 at 00:26
  • 1
    Try with `ans = grad(f1)(values)`. – Labo Jun 20 '18 at 00:28