set missing values to constant in R, computational speed

Question

In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric. It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:

d[is.na(d)] <- 0

but this is rather slow. Is there a better way to do this in R?

I am open to using other R packages.

I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.

Thanks!

Edited Solution: As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude

Does this data frame have columns of the same type (ie all numeric, or all character)? Storing it as a matrix might speed this up. — Spacedman, Oct 17 '16 at 21:36
often converting to `data.table` provides a speed improvement on many operations, but `is.na.data.table` is not one of them. — shayaa, Oct 17 '16 at 22:00
@ Spacedman, all numeric - sorry should have specified that. will edit. — Peter, Oct 17 '16 at 22:20

BChan · Answer 1 · 2016-10-17T23:04:23.367

You can get a considerable performance increase using the data.table package. It is much faster, in general, with a lot of manipulations and transformations. The downside is the learning curve of the syntax. However, if you are looking for a speed performance boost, the investment could be worth it.

Generate fake data

r <- 10500  
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))

Base R

df1 <- df
system.time(df1[is.na(df1)] <- 0)

   user  system elapsed 
   4.74    0.00    4.78

tidyr - replace_na()

dfReplaceNA <- function (df) {
  require(tidyr)
  l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
  replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))

   user  system elapsed 
   4.27    0.00    4.28

data.table - set()

dtReplaceNA <- function (df) {
  require(data.table)
  dt <- data.table(df)
  for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
  setDF(dt)  # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))

   user  system elapsed 
   0.80    0.31    1.11

Compare data frames

all.equal(df1, df2)

[1] TRUE

all.equal(df1, df3)

[1] TRUE

score 1 · Accepted Answer · answered Oct 17 '16 at 22:36

I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.

I get the following timings, with approximately 10,000 NAs:

> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
   user  system elapsed 
   0.19    0.12    0.31 
> system.time(D[is.na(D)] <- 0)
   user  system elapsed 
   3.87    0.06    3.95

So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?

I hope this helps.

I am looping over several hundred of these size data.frames, some of which are much larger also, so the speed boost is of practical relevance. Thanks for the answer. — Peter, Oct 18 '16 at 00:10

set missing values to constant in R, computational speed

2 Answers2