20

I have a dataframe of 150,000 rows with 2,000 columns containing values, some being negatives. I am replacing those negative values by 0, but it is extremely slow to do so (~60min or more).

df[df < 0] = 0

where df[,1441:1453] looks like (all columns/values numeric):

  V1441 V1442 V1443 V1444 V1445 V1446 V1447 V1448 V1449 V1450 V1451 V1452 V1453
1     3     1     0     4     4    -2     0     3    12     5    17    34    27
2     0     1     0     7     0     0     0     1     0     0     0     0     0
3     0     2     0     1     2     3     6     1     2     1    -6     3     1
4     1     2     3     6     1     2     1    -6     3     1    -4     1     0
5     1     2     1    -6     3     1    -4     1     0     0     1     0     0
6     1     0     0     1     0     0     0     0     0     0     1     2     2

Is there a way to speed up such process, eg the way I am doing it is utterly slow, and there is faster approach for this ? Thanks.

Roland
  • 127,288
  • 10
  • 191
  • 288
Benoit B.
  • 11,854
  • 8
  • 26
  • 29
  • 150,000 rows x 2000 columns : this is correct – Benoit B. Oct 11 '12 at 09:06
  • Can you please [show us](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) how your data looks like? Are all columns numeric? – Roland Oct 11 '12 at 09:08

3 Answers3

32

Try transforming your df to a matrix.

df <- data.frame(a=rnorm(1000),b=rnorm(1000))
m <- as.matrix(df)
m[m<0] <- 0
df <- as.data.frame(m)
Roland
  • 127,288
  • 10
  • 191
  • 288
23

Both your original approach and the current answer create an object the same size as m (or df) when creating m<0 (the matrix approach is quicker because there is less internal copying with [<- compared with [<-.data.frame

You can use lapply and replace, then you are only looking at a vector or length (nrow(df)) each time and not copying so much

df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})

The above code should be quite effiicent.

If you use data.table, then most of the memory (and) time inefficiency of the data.frame approach is removed. It would be ideal for a large data situation like yours.

library(data.table)
# this really shouldn't be 
DT <- lapply(df, function(x){replace(x, x <0,0)})
# change to data.table
setattr(DT, 'class', c('data.table','data.frame'))
# or 
# DT <- as.data.table(df, function(x){replace(x, x <0,0)})

You could set keys on all the columns and then replacing by reference for key values less than 0

mnel
  • 113,303
  • 27
  • 265
  • 254
0

Another data.table answer, might be faster, and definitely should consume less memory.

library(data.table)
set.seed(108)
d = data.table(a=rnorm(1000),b=rnorm(1000))
set.colwise = function(x, i, j, value) {
  replace_dot_j = function(e, j) {
    if (is.symbol(e) && identical(e, as.symbol(".j"))) return(j)
    if (is.call(e)) {
      if (e[[1L]] == ".j") e[[1L]] = j
      for (i in seq_along(e)[-1L]) if (!is.null(e[[i]])) e[[i]] = replace_dot_j(e[[i]], j)
    }
    e
  }
  for (jj in j) eval(substitute(
    set(x, .i, .j, value),
    list(
      .i=replace_dot_j(substitute(i), jj),
      .j=jj
    )
  ))
  invisible(x)
}
d
set.colwise(d, i = which(d[[.j]] < 0), j = c("a","b"), value = 0)
d

.j symbol used in i argument is iterated and replaced with columns from j argument.

jangorecki
  • 16,384
  • 4
  • 79
  • 160