2

I am trying to create a function which can find missing location and impute the missing in a data table. Now this function uses is.na() extensively to find out the missing location and also to replace it with imputation value. It is working fine for all type of variable until input is character type column and have blank cells as missing, because is.na() is not able to identify it as missing hence it skips these cells for imputation.

Example:

    library(data.table)
    t<-data.table(x=c('an','ax','','az'),y=c('bn','','bz','bx'))
          x  y
      1: an bn
      2: ax      
      3:    bz 
      4: az bx
      is.na(t[,x])
      [1] FALSE FALSE FALSE FALSE

where it should be

      [1] FALSE FALSE TRUE FALSE

Any help is highly appreciated.

Thanks.

Anuj
  • 303
  • 1
  • 3
  • 13
  • 3
    Please show a small reproducible example and the expected result. For guidelines, check [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). If you have `''` as missing. then `yourdf$yourCol==''` should give the logical TRUE/FALSE for `''` – akrun Jun 19 '15 at 11:48

2 Answers2

4

You can use the fast nzchar like this :

is.na(x) | !nzchar(x) 

For example :

x <- c(NA,'','a')
is.na(x) | !nzchar(x) 
## [1]  TRUE  TRUE FALSE

apply this to OP example:

I wrap this in a function with ifelse :

tt <- data.table(x=c('an','ax','','az'),y=c('bn','','bz','bx'))
tt[, lapply(.SD,
            function(x)
              ifelse(is.na(x) | !nzchar(x),'some value',x)) ]

           x          y
1:         an         bn
2:         ax some value
3: some value         bz
4:         az         bx
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • I don't think this could work for data tables. I tried to use your code on my data table but it gives the same result. nzchar(test_dt) [1] TRUE Although it works perfect for vector type input. – Anuj Jun 19 '15 at 13:52
  • @Anuj that's wahy you should give a reproducible example dn the desired output. Even if you did it later, but you example is still not reproducible. Can you use `dput` to add your data? Please take the time to read the link in the comment below your question. – agstudy Jun 19 '15 at 14:03
  • I have added reproducible example. Sorry for the confusion, I am new to stackoverflow. Also your solution is working fine if I use it like this is.na(t[,x]| !nzchar(t[,x]) but when I call the column using index, it gives the same result. – Anuj Jun 19 '15 at 14:25
  • @Anuj no roblem. You are welcome. I add a solution using your data. – agstudy Jun 19 '15 at 15:21
  • 2
    Instead of `ifelse` (which has something of a rep for being slow), one could use `replace(x, which(is.na(x) | !nzchar(x)), 'some value')` – Frank Jun 19 '15 at 15:52
  • @Frank Interesting. Can you add some benchmarking please? – agstudy Jun 19 '15 at 16:48
  • 1
    @agstudy The claim about it being slow is based on Ricardo Saporta's post here: http://stackoverflow.com/q/16275149/1191259 I don't think it actually applies to your answer, but more to cases like `ifelse(cond,costly_calculation,costly_calculation2)` since both costly calculations have to be made in their entirety even though only a part of each is used. – Frank Jun 19 '15 at 17:25
  • 2
    Oh actually, here's a benchmark for a similar case to yours where its 10x as slow: `x <- sample(1:2,1e7,replace=TRUE); system.time(replace(x,which(x==2L),0L)); system.time(ifelse(x==2L,0L,x))` – Frank Jun 19 '15 at 17:28
0

Another solution using conditional assignment (using i):

DT <- data.table(x = c('an','ax','','az',NA),
                 y = c(NA,'bn','','bz','bx'))
DT[x %in% c(NA, ""), x := 'some value']
DT[y %in% c(NA, ""), y := 'some value']

Result:

            x          y
1:         an some value
2:         ax         bn
3: some value some value
4:         az         bz
5: some value         bx
Artem Klevtsov
  • 9,193
  • 6
  • 52
  • 57