7

I have a large data.table object (1M rows and 220 columns) and I want to replace all blanks ('') with NA. I found a solution in this Post, but it's extremely slow for my data table (takes already over 15mins) Example from the other post:

 data = data.frame(cats=rep(c('', ' ', 'meow'),1e6),
                   dogs=rep(c("woof", " ", NA),1e6))
 system.time(x<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))

Is there a more data.table fast way to achieve this?

Indeed the provided data does not look much like the original data, it was just to give an example. The following subset of my real data gives the CharToDate(x) error:

DT <- data.table(ID=c(10),DEFAULT_DATE=as.Date("2012-07-31"),value='')
system.time(DT[DT=='']<-NA)
Community
  • 1
  • 1
Tim_Utrecht
  • 1,459
  • 6
  • 24
  • 44

4 Answers4

17

Here's probably the generic data.table way of doing this. I'm also going to use your regex which handles several types of blanks (I havn't seen other answers doing this). You probably shouldn't run this over all your columns rather only over the factor or character ones, because other classes won't accept blank values.

For factors

indx <- which(sapply(data, is.factor))
for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_) 

For characters

indx2 <- which(sapply(data, is.character)) 
for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 2
    Thanks a lot (again) @David. Worked in 4.79 seconds! – Tim_Utrecht Jul 20 '15 at 13:13
  • now I get the NA values as which are not recognized by the `is.na()` function. If I change it to `system.time(for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA))` it fails, what you probably also encountered? Is there a way of setting the values to the 'standard' NA? – Tim_Utrecht Jul 20 '15 at 13:42
  • Isn't it strange that the NA_character is recognized in the `is.na(NA_character_)` but not if I use this function for the data.table where I just replaced blanks with that NA_character_: `data[is.na(DEFAULT_DATE)]->dataNA`. Or am I missing something? PS. see edit in Question for the new data. – Tim_Utrecht Jul 20 '15 at 14:05
  • 1
    Ok, try this `indx <- which(sapply(data, is.factor)); system.time(for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_)); indx2 <- which(sapply(data, is.character)); system.time(for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_))` – David Arenburg Jul 20 '15 at 14:11
  • @DavidArenburg I don't know why but I'm testing `data[data=='']<-NA` versus `for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)` with `microbenchmark` package and tells me first option is much faster. – Oriol Prat May 29 '18 at 11:37
  • @OriolPrat you are comparing apples with oranges. Running regex vs doing an exact match has nothing to do with data.table. The point of this answer was to show how to use OPs regex using correct data.table syntax. In this case regex wasn't probably needed in the first place. Although `data[data=='']<-NA` doesn't replace the `" "` values. – David Arenburg May 29 '18 at 13:00
8

Use this approach:

system.time(data[data==''|data==' ']<-NA)
  user  system elapsed 
  1.47    0.19    1.66 

system.time(y<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
  user  system elapsed 
  3.41    0.20    3.64
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
3

Assuming you had mistake while populating your data, below is the solution using data.table which you used in tag.

library(data.table)
data = data.table(cats=rep(c('', ' ', 'meow'),1000000),dogs=rep(c("woof", " ", NA),1000000))
system.time(data[cats=='', cats := NA][dogs=='', dogs := NA])
#  user  system elapsed 
# 0.056   0.000   0.059 

If you have a lot of column see David's comment.

jangorecki
  • 16,384
  • 4
  • 79
  • 160
1

After trying a few different ways to do this, I found the quickest and simplest option to be:

data[data==""] <- NA
M. Warren
  • 111
  • 1
  • 5