5

What is the numpy or pandas equivalent of the R function sweep()?

To elaborate: in R let's say we have a coefficient vector, say beta (numeric type) and an array, say data (20x5 numeric type). I want to superimpose the vector on each row of the array and multiply the corresponding elements. And then return the resultant (20x5) array I could achieve this using sweep().

Equivalent sample R code:

beta <-  c(10, 20, 30, 40)
data <- array(1:20,c(5,4))
sweep(data,MARGIN=2,beta,`*`)
#---------------
 > data
      [,1] [,2] [,3] [,4]
 [1,]    1    6   11   16
 [2,]    2    7   12   17
 [3,]    3    8   13   18
 [4,]    4    9   14   19
 [5,]    5   10   15   20

 > beta
 [1] 10 20 30 40

 > sweep(data,MARGIN=2,beta,`*`)
      [,1] [,2] [,3] [,4]
 [1,]   10  120  330  640
 [2,]   20  140  360  680
 [3,]   30  160  390  720
 [4,]   40  180  420  760
 [5,]   50  200  450  800

I have heard exciting things about numpy and pandas in Python and it seems to have a lot of R like commands. What would be the fastest way to achieve the same using these libraries? The actual data has millions of rows and around 50 columns. The beta vector is of course conformable with data.

smci
  • 32,567
  • 20
  • 113
  • 146
sriramn
  • 2,338
  • 4
  • 35
  • 45
  • 3
    Since some knowledgeable pandas users might not have R installed, this question could be greatly improved by showing the inputs to and output from `sweep` – Paul H Apr 16 '14 at 19:05
  • What is this MARGIN? docs are unclear on what difference is between just sweeping (i.e. `beta * data`). – Andy Hayden Apr 16 '14 at 19:07
  • MARGIN indicates whether to work on columns on rows, MARGIN=2 means columns and 1 means rows – infominer Apr 16 '14 at 19:10
  • It's a lot easier to see what is happening if the numbers are not random. Hence my edit. – IRTFM Apr 16 '14 at 19:21
  • Possible Duplicate? http://stackoverflow.com/questions/3643555/multiply-rows-of-matrix-by-vector – infominer Apr 16 '14 at 19:21
  • It would require a bit of fiddling but `vstack` and e.g. `for i in range(1,6): out = i*array([10, 20, 30, 40])` should do the trick. – Aleksander Lidtke Apr 16 '14 at 19:23
  • @infominer Are you sure that's the case? the r docs suggest it's more subtle... if that is the case the docs here are *terrible*. – Andy Hayden Apr 16 '14 at 19:29
  • @AndyHayden. Yes I am sure. `sweep` is based on `apply`. if you read the documentation for apply, the definition is amply clear IMHO. Before you ask, yes apply is mentioned on the docs page for sweep (look under see also). – infominer Apr 16 '14 at 19:34

3 Answers3

6

Pandas has an apply() method too, apply being what R's sweep() uses under the hood. (Note that the MARGIN argument is "equivalent" to the axis argument in many pandas functions, except that it takes values 0 and 1 rather than 1 and 2).

np.random.seed = 1    
beta = pd.Series(np.random.randn(5))    
data = pd.DataFrame(np.random.randn(20, 5))

You can use an apply with a function which is called on each row:

data.apply(lambda row: row * beta, axis=1)

Note: that axis=0 would apply to each column, this is the default as data is stored column-wise and so column-wise operations are more efficient.

However, in this case it's easy to make significantly faster (and more readable) to vectorize, simply by multiplying row-wise:

In [21]: data.apply(lambda row: row * beta, axis=1).head()
Out[21]:
          0         1         2         3         4
0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587
1  0.026433  0.355915 -0.672302  0.225446 -0.520374
2  0.042254 -1.223200 -0.545957  0.103864 -0.372855
3  0.086367  0.218539 -1.033671  0.218388 -0.598549
4  0.203071 -3.402876  0.192504 -0.147548 -0.726001

In [22]: data.mul(beta, axis=1).head()  # just show first few rows with head
Out[22]:
          0         1         2         3         4
0 -0.024827 -1.465294 -0.416155 -0.369182 -0.649587
1  0.026433  0.355915 -0.672302  0.225446 -0.520374
2  0.042254 -1.223200 -0.545957  0.103864 -0.372855
3  0.086367  0.218539 -1.033671  0.218388 -0.598549
4  0.203071 -3.402876  0.192504 -0.147548 -0.726001

Note: this is slightly more robust / allows more control than using *.

You can do the same in numpy (ie data.values here), either multiplying directly, this will be faster as it doesn't worry about data-alignment, or using vectorize rather than apply.

smci
  • 32,567
  • 20
  • 113
  • 146
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Great answer. I am playing around with this approach now and had a question. Can you please comment on the use of `lambda` in `apply`? Any reason to prefer it over a function declared using `def`? Thanks much. – sriramn Apr 18 '14 at 17:12
  • It's just an anonymous function, there's not reason to prefer it (other than pleasing/concise syntax)! – Andy Hayden Apr 18 '14 at 17:51
4

In numpy the concept is called "broadcasting". Example:

import numpy as np
x = np.random.random((4, 3))
x * np.array(range(4))[:, np.newaxis] # sweep along the rows
x + np.array(range(3))[np.newaxis, :] # sweep along the columns
qed
  • 22,298
  • 21
  • 125
  • 196
-1

Does this work faster?

t(t(data) * beta)

Some other great answers here with profiling Multiply rows of matrix by vector?

and finally to answer your query about numpy. Use this reference (search for Matrix Multiplication) http://mathesaurus.sourceforge.net/r-numpy.html

Community
  • 1
  • 1
infominer
  • 1,981
  • 13
  • 17
  • Yes, edited to included a reference to equivalent calls in numpy. and suggested my code as a workaround for speed issues as sweep is slow for matrix multiplication. The O.P can look at my linked answer to see run times. – infominer Apr 16 '14 at 19:25