120

When I look at the source of R Packages, i see the function sweep used quite often. Sometimes it's used when a simpler function would have sufficed (e.g., apply), other times, it's impossible to know exactly what it's is doing without spending a fair amount of time to step through the code block it's in.

The fact that I can reproduce sweep's effect using a simpler function suggests that i don't understand sweep's core use cases, and the fact that this function is used so often suggests that it's quite useful.

The context:

sweep is a function in R's standard library; its arguments are:

sweep(x, MARGIN, STATS, FUN="-", check.margin=T, ...)

# x is the data
# STATS refers to the summary statistics which you wish to 'sweep out'
# FUN is the function used to carry out the sweep, "-" is the default

As you can see, the arguments are similar to apply though sweep requires one more parameter, STATS.

Another key difference is that sweep returns an array of the same shape as the input array, whereas the result returned by apply depends on the function passed in.

sweep in action:

# e.g., use 'sweep' to express a given matrix in terms of distance from 
# the respective column mean

# create some data:
M = matrix( 1:12, ncol=3)

# calculate column-wise mean for M
dx = colMeans(M)

# now 'sweep' that summary statistic from M
sweep(M, 2, dx, FUN="-")

     [,1] [,2] [,3]
[1,] -1.5 -1.5 -1.5
[2,] -0.5 -0.5 -0.5
[3,]  0.5  0.5  0.5
[4,]  1.5  1.5  1.5

So in sum, what i'm looking for is an exemplary use case or two for sweep.

Please, do not recite or link to the R Documentation, mailing lists, or any of the 'primary' R sources--assume I've read them. What I'm interested in is how experienced R programmers/analysts use sweep in their own code.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
doug
  • 69,080
  • 24
  • 165
  • 199
  • 2
    M-dx does not replicate your result. You answered your own question. – John Aug 10 '10 at 00:59
  • The only usage of `apply` that I can figure out for this result is something like `t(apply(t(M), 2, "-", dx))`, but that's pretty nasty. – Ken Williams May 04 '11 at 14:32

5 Answers5

103

sweep() is typically used when you operate a matrix by row or by column, and the other input of the operation is a different value for each row / column. Whether you operate by row or column is defined by MARGIN, as for apply(). The values used for what I called "the other input" is defined by STATS. So, for each row (or column), you will take a value from STATS and use in the operation defined by FUN.

For instance, if you want to add 1 to the 1st row, 2 to the 2nd, etc. of the matrix you defined, you will do:

sweep (M, 1, c(1: 4), "+")

I frankly did not understand the definition in the R documentation either, I just learned by looking up examples.

Rekyt
  • 354
  • 1
  • 8
Daniele Merico
  • 1,046
  • 1
  • 9
  • 2
  • 4
    to paraphrase a little: `STATS` seems to be a bad label for this variable. It's an input to `FUN` that gets used to modify the value of each element in the matrix (`M`, in this example). `STATS` can be either a constant or a list/vector/etc of a size matching the size of the chosen `MARGIN`. I think. – Roland Dec 28 '17 at 03:54
  • I suspect they used parameter `STATS` because the function was designed as a tool for scaling where you substract `colMeans` and divide by columns'standard deviation (`scale()`function is named in the "See also" part of the documentation). But as a matter of fact it doesn't have to be any kind ofstatistics. That's why the R documentation is misleading, I guess. – Comevussor Feb 07 '22 at 08:07
16

sweep() can be great for systematically manipulating a large matrix either column by column, or row by row, as shown below:

> print(size)
     Weight Waist Height
[1,]    130    26    140
[2,]    110    24    155
[3,]    118    25    142
[4,]    112    25    175
[5,]    128    26    170

> sweep(size, 2, c(10, 20, 30), "+")
     Weight Waist Height
[1,]    140    46    170
[2,]    120    44    185
[3,]    128    45    172
[4,]    122    45    205
[5,]    138    46    200

Granted, this example is simple, but changing the STATS and FUN argument, other manipulations are possible.

Brad Horn
  • 649
  • 6
  • 12
6

This question is a bit old, but since I've recently faced this problem a typical use of sweep can be found in the source code for the stats function cov.wt, used for computing weighted covariance matrices. I'm looking at the code in R 3.0.1. Here sweep is used to subtract out column means before computing the covariance. On line 19 of the code the centering vector is derived:

 center <- if (center) 
        colSums(wt * x)
    else 0

and on line 54 it is swept out of the matrix

x <- sqrt(wt) * sweep(x, 2, center, check.margin = FALSE)

The author of the code is using the default value FUN = "-", which confused me for a while.

James King
  • 6,229
  • 3
  • 25
  • 40
3

One use is when you're computing weighted sums for an array. Where rowSums or colSums can be assumed to mean 'weights=1', sweep can be used prior to this to give a weighted result. This is particularly useful for arrays with >=3 dimensions.

This comes up e.g. when calculating a weighted covariance matrix as per @James King's example.

Here's another based on a current project:

set.seed(1)
## 2x2x2 array
a1 <- array(as.integer(rnorm(8, 10, 5)), dim=c(2, 2, 2))
## 'element-wise' sum of matrices
## weights = 1
rowSums(a1, dims=2)
## weights
w1 <- c(3, 4)
## a1[, , 1] * 3;  a1[, , 2] * 4
a1 <- sweep(a1, MARGIN=3, STATS=w1, FUN="*")
rowSums(a1, dims=2)
dardisco
  • 5,086
  • 2
  • 39
  • 54
0

You could use sweep function to scale and center data like the following code. Note that means and sds are arbitrary here (you may have some reference values that you want to standardize data based on them):

df=matrix(sample.int(150, size = 100, replace = FALSE),5,5)

df_means=t(apply(df,2,mean))
df_sds=t(apply(df,2,sd))

df_T=sweep(sweep(df,2,df_means,"-"),2,df_sds,"/")*10+50

This code convert raw scores to T scores (with mean=50 and sd=10):

> df
     [,1] [,2] [,3] [,4] [,5]
[1,]  109    8   89   69   15
[2,]   85   13   25  150   26
[3,]   30   79   48    1  125
[4,]   56   74   23  140  100
[5,]  136  110  112   12   43
> df_T
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 56.15561 39.03218 57.46965 49.22319 40.28305
[2,] 50.42946 40.15594 41.31905 60.87539 42.56695
[3,] 37.30704 54.98946 47.12317 39.44109 63.12203
[4,] 43.51037 53.86571 40.81435 59.43685 57.93136
[5,] 62.59752 61.95672 63.27377 41.02349 46.09661
Ehsan88
  • 3,569
  • 5
  • 29
  • 52
  • 1
    @BenBolker as I mentioned in the answer, because I may want to scale the items according to a reference mean and sd, not the mean and sd of the current sample itself. It occurs when you deal with tests that are administered and standardized in large samples, and you want to standardize your small sample score according to their statistics. – Ehsan88 Sep 23 '14 at 14:57