1
np.random.seed([3, 14])
df = pd.DataFrame(np.random.randn(5, 3), columns=list('ABC'))
df

          A         B         C
0 -0.602923 -0.402655  0.302329
1 -0.524349  0.543843  0.013135
2 -0.326498  1.385076 -0.132454
3 -0.407863  1.302895 -0.604236
4 -0.243362 -0.211261 -2.056621

What is the fastest way to compute df.A * 1 + df.B * 2 + df.C * 3?

Essentially, I want, for this dataframe:

0   -0.501247
1    0.602741
2    2.046290
3    0.385219
4   -6.835748

The answer cannot be df.A * 1 + df.B * 2 + df.C * 3 since the number of columns must not be hardcoded. So, I'd want to compute df.iloc[:, 0] * 1 + df.iloc[:, 1] * 2, .... somehow.

I'd be interested in any numba solutions out there too!

cs95
  • 379,657
  • 97
  • 704
  • 746

2 Answers2

2

I try improve solution - remove reshape and change arrange:

a = df.dot(np.arange(1, len(df.columns)+1))
print (a)
0   -0.501247
1    0.602741
2    2.046290
3    0.385219
4   -6.835748
dtype: float64

Same in numpy:

a = pd.Series(np.dot(df.values, np.arange(1, len(df.columns)+1)), index=df.index)
print (a)
0   -0.501247
1    0.602741
2    2.046290
3    0.385219
4   -6.835748
dtype: float64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • First solution: `36.3ms`, Second solution: `36.5 ms` on large data. I am surprised that Series constructor is so cheap. – cs95 Sep 08 '17 at 09:02
  • Maybe is necessary test in large data - 100k * 100k – jezrael Sep 08 '17 at 09:03
  • 100k is a lot of columns... possibility of overflow during calculation. – cs95 Sep 08 '17 at 09:05
  • It was only idea, no problem ;) – jezrael Sep 08 '17 at 09:06
  • Yeah... I can understand... I originally planned to test on 100k * 100k but when generating the data my computer became unresponsive... maybe you can test it? But I'm sure it will overflow. – cs95 Sep 08 '17 at 09:07
  • I use working PC with win and 4GB RAM ;), But I can check the largest df as possible ;) – jezrael Sep 08 '17 at 09:08
  • I have a 8gig mac, but it's over 4 years old... let me know how it goes! – cs95 Sep 08 '17 at 09:09
  • I think my test are tottaly useless for `[1000000 rows x 1000 columns]` I get `In [2]: %timeit df.dot(np.arange(1, len(df.columns)+1)) 1 loop, best of 3: 21 s per loop`, `In [5]: %timeit df.dot((np.arange(df.shape[1]) + 1).reshape(-1, 1)) 1 loop, best of 3: 27 s per loop` – jezrael Sep 08 '17 at 10:04
  • Wow... that's a crazy difference, considering the only change is the +1. :D – cs95 Sep 08 '17 at 10:08
  • 1
    Thank you for your time and efforts. You have my upvote :-) – cs95 Sep 08 '17 at 10:10
1

Option 1

The fastest, to my knowledge, would be using df.dot.

df.dot((np.arange(df.shape[1]) + 1).reshape(-1, 1))

          0
0 -0.501247
1  0.602741
2  2.046290
3  0.385219
4 -6.835748

Option 2

Element wise product and sum along first axis

(df * (np.arange(df.shape[1]) + 1)).sum(1)

0   -0.501246
1    0.602742
2    2.046292
3    0.385219
4   -6.835747

Performance

Small (5 x 3)

10000 loops, best of 3: 131 µs per loop  # dot
1000 loops, best of 3: 531 µs per loop   # element-wise prod + sum

Large (100000 x 1000)

10 loops, best of 3: 36.4 ms per loop   # dot
1 loop, best of 3: 1.18 s per loop      # element-wise prod + sum

For information on the magic behind the implementation of pandas/numpy's dot function, you may look at Why is matrix multiplication faster with numpy than with ctypes in Python?.

cs95
  • 379,657
  • 97
  • 704
  • 746