5

For example, I would like to compute the weighted sum of columns 'a' and 'c' for the below matrix, with weights defined in the dictionary w.

df = pd.DataFrame({'a': [1,2,3], 
                   'b': [10,20,30], 
                   'c': [100,200,300],
                   'd': [1000,2000,3000]})
w = {'a': 1000., 'c': 10.}

I figured out some options myself (see below), but all look a bit complicated. Isn't there a direct pandas operation for this basic use-case? Something like df.wsum(w)?


I tried pd.DataFrame.dot, but it raises a value error:

df.dot(pd.Series(w))
# This raises an exception:
# "ValueError: matrices are not aligned"

The exception can be avoided by specifying a weight for every column, but this is not what I want.

w = {'a': 1000., 'b': 0., 'c': 10., 'd': 0. }
df.dot(pd.Series(w)) # This works

How can one compute the dot product on a subset of columns only? Alternatively, one could select the columns of interest before applying the dot operation, or exploit the fact that pandas/numpy ignores nans when computing (row-wise) sums (see below).

Here are three methods that I was able to spot out myself:

w = {'a': 1000., 'c': 10.}

# 1) Create a complete lookup W.
W = { c: 0. for c in df.columns }
W.update(w)
ret = df.dot(pd.Series(W))

# 2) Select columns of interest before applying the dot product.
ret = df[list(w.keys())].dot(pd.Series(w))

# 3) Exploit the handling of NaNs when computing the (row-wise) sum
ret = (df * pd.Series(w)).sum(axis=1)
# (df * pd.Series(w)) contains columns full of nans

Was I missing an option?

normanius
  • 8,629
  • 7
  • 53
  • 83
  • 1
    Not much better than your third option but... `df.mul(w).sum(axis=1)` – ayhan Dec 30 '18 at 19:05
  • Can you share what you don't like about your current three solutions to your problem? – gosuto Dec 30 '18 at 19:06
  • You could do: `df.loc[:, w].dot(pd.Series(w))` – Dani Mesejo Dec 30 '18 at 19:17
  • @jorijnsmit I wondered whether I missed the best option. Weighted combinations of a set of columns are common, and by experience pandas typically offers simple solutions for such common tasks. I also can imagine that my three suggestions are not equivalent in terms of costs. Lastly, I didn't find a satisfactory answer on SO that addresses my problem, which is why I collected my current understanding and asked for a clarification. Maybe someone else will benefit from this. – normanius Dec 30 '18 at 20:13

4 Answers4

4

You could use a Series as in your first example, just use reindex afterwards:

import pandas as pd

df = pd.DataFrame({'a': [1,2,3],
                   'b': [10,20,30],
                   'c': [100,200,300],
                   'd': [1000,2000,3000]})

w = {'a': 1000., 'c': 10.}
print(df.dot(pd.Series(w).reindex(df.columns, fill_value=0)))

Output

0    2000.0
1    4000.0
2    6000.0
dtype: float64
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • 1
    According to my benchmark (see [my answer](https://stackoverflow.com/a/66638597/3388962)), your solution is the fastest for unaligned weight vectors `w`. I therefore accepted your answer. – normanius Mar 15 '21 at 13:10
3

Here's an option without having to create a pd.Series:

(df.loc[:,w.keys()] * list(w.values())).sum(axis=1)
0    2000.0
1    4000.0
2    6000.0
yatu
  • 86,083
  • 12
  • 84
  • 139
2

I stumbled on my own question again and benchmarked the available answers.

Observation: It is worthwhile to fill the incomplete weight vector with zeros first, rather than first capturing a view on the columns and then having the resulting subframe dot-multiplied.


import pandas as pd
import numpy as np

def benchmark(n_rows, n_cols, n_ws):
    print("n_rows:%d, n_cols:%d, n_ws:%d" % (n_rows, n_cols, n_ws))
    df = pd.DataFrame(np.random.randn(n_rows, n_cols), 
                      columns=range(n_cols))
    w = dict(zip(np.random.choice(np.arange(n_cols), n_ws), 
                 np.random.randn(n_ws)))
    w0 = pd.Series(w).reindex(df.columns, fill_value=0).values

    # Method 0 (aligned vector w0, reference!)
    def fun0(df, w0): return df.values.dot(w0)
    # Method 1 (reindex)
    def fun1(df, w): return df.dot(pd.Series(w).reindex(df.columns, fill_value=0))
    # Method 2 (column view)
    def fun2(df, w): return (df.loc[:,w.keys()] * list(w.values())).sum(axis=1)
    # Method 3 (column view, faster)
    def fun3(df, w): return df.loc[:, w].dot(pd.Series(w))
    # Method 4 (column view, numpy)
    def fun4(df, w): return df[list(w.keys())].values.dot(list(w.values()))

    # Assert equivalence
    np.testing.assert_array_almost_equal(fun0(df,w0), fun1(df,w), decimal=10)
    np.testing.assert_array_almost_equal(fun0(df,w0), fun2(df,w), decimal=10)
    np.testing.assert_array_almost_equal(fun0(df,w0), fun3(df,w), decimal=10)
    np.testing.assert_array_almost_equal(fun0(df,w0), fun4(df,w), decimal=10)

    print("fun0:", end=" ")
    %timeit fun0(df, w0)
    print("fun1:", end=" ")
    %timeit fun1(df, w)
    print("fun2:", end=" ")
    %timeit fun2(df, w)
    print("fun3:", end=" ")
    %timeit fun3(df, w)
    print("fun4:", end=" ")
    %timeit fun4(df, w)

benchmark(n_rows = 200000, n_cols = 11, n_ws = 3)
benchmark(n_rows = 200000, n_cols = 11, n_ws = 9)
benchmark(n_rows = 200000, n_cols = 31, n_ws = 5)

The output (fun0() is the reference using the zero-filled vector w0):

n_rows:200000, n_cols:11, n_ws:3
fun1: 1.98 ms ± 86.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun2: 9.66 ms ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun3: 2.68 ms ± 90.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun4: 2.2 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

n_rows:200000, n_cols:11, n_ws:9
fun1: 1.85 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
fun2: 11.7 ms ± 54.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun3: 3.7 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun4: 3.17 ms ± 29.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

n_rows:200000, n_cols:31, n_ws:5
fun1: 3.08 ms ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun2: 13.1 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
fun3: 5.48 ms ± 57 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
fun4: 4.98 ms ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I've tested with pandas 1.2.3, numpy 1.20.1 and Python 3.9.0. on a MacBookPro (Late 2015). (Similar results hold for older Python versions).

normanius
  • 8,629
  • 7
  • 53
  • 83
1

Using numpy dot with values

df[list(w.keys())].values.dot(list(w.values()))
array([2000., 4000., 6000.])

Fixed your error

df.mul( pd.Series(w),1).sum(axis=1)
0    2000.0
1    4000.0
2    6000.0
dtype: float64
BENY
  • 317,841
  • 20
  • 164
  • 234