1

In R, I have a reasonably large data frame (d) which is 10500 by 6000. All values are numeric. It has many na value elements in both its rows and columns, and I am looking to replace these values with a zero. I have used:

d[is.na(d)] <- 0

but this is rather slow. Is there a better way to do this in R?

I am open to using other R packages.

I would prefer it if the discussion focused on computational speed rather than, "why would you replace na's with zeros", for example. And, while I realize a similar Q has been asked (How do I replace NA values with zeros in an R dataframe?) the focus has not been towards computational speed on a large data frame with many missing values.

Thanks!

Edited Solution: As helpfully suggested, changing d to a matrix before applying is.na sped up the computation by an order of magnitude

Community
  • 1
  • 1
Peter
  • 77
  • 1
  • 6
  • 4
    Does this data frame have columns of the same type (ie all numeric, or all character)? Storing it as a matrix might speed this up. – Spacedman Oct 17 '16 at 21:36
  • 1
    often converting to `data.table` provides a speed improvement on many operations, but `is.na.data.table` is not one of them. – shayaa Oct 17 '16 at 22:00
  • @ Spacedman, all numeric - sorry should have specified that. will edit. – Peter Oct 17 '16 at 22:20

2 Answers2

2

You can get a considerable performance increase using the data.table package. It is much faster, in general, with a lot of manipulations and transformations. The downside is the learning curve of the syntax. However, if you are looking for a speed performance boost, the investment could be worth it.

Generate fake data

r <- 10500  
c <- 6000
x <- sample(c(NA, 1:5), r * c, replace = TRUE)
df <- data.frame(matrix(x, nrow = r, ncol = c))

Base R

df1 <- df
system.time(df1[is.na(df1)] <- 0)

   user  system elapsed 
   4.74    0.00    4.78 

tidyr - replace_na()

dfReplaceNA <- function (df) {
  require(tidyr)
  l <- setNames(lapply(vector("list", ncol(df)), function(x) x <- 0), names(df))
  replace_na(df, l)
}
system.time(df2 <- dfReplaceNA(df))

   user  system elapsed 
   4.27    0.00    4.28 

data.table - set()

dtReplaceNA <- function (df) {
  require(data.table)
  dt <- data.table(df)
  for (j in 1:ncol(dt)) {set(dt, which(is.na(dt[[j]])), j, 0)}
  setDF(dt)  # Return back a data.frame object
}
system.time(df3 <- dtReplaceNA(df))

   user  system elapsed 
   0.80    0.31    1.11 

Compare data frames

all.equal(df1, df2)

[1] TRUE

all.equal(df1, df3)

[1] TRUE
BChan
  • 83
  • 6
1

I guess that all columns must be numeric or assigning 0s to NAs wouldn't be sensible.

I get the following timings, with approximately 10,000 NAs:

> M <- matrix(0, 10500, 6000)
> set.seed(54321)
> r <- sample(1:10500, 10000, replace=TRUE)
> c <- sample(1:6000, 10000, replace=TRUE)
> M[cbind(r, c)] <- NA
> D <- data.frame(M)
> sum(is.na(M)) # check
[1] 9999
> sum(is.na(D)) # check
[1] 9999
> system.time(M[is.na(M)] <- 0)
   user  system elapsed 
   0.19    0.12    0.31 
> system.time(D[is.na(D)] <- 0)
   user  system elapsed 
   3.87    0.06    3.95 

So, with this number of NAs, I get about an order of magnitude speedup by using a matrix. (With fewer NAs, the difference is smaller.) But the time using a data frame is just 4 seconds on my modest laptop -- much less time than it took to answer the question. If the problem really is of this magnitude, why is that slow?

I hope this helps.

John Fox
  • 284
  • 1
  • 3
  • I am looping over several hundred of these size data.frames, some of which are much larger also, so the speed boost is of practical relevance. Thanks for the answer. – Peter Oct 18 '16 at 00:10