0

I have a dataset with 50 columns and I would like to write a function that would assign a zero, 'none', or 99 (as I specify) to each of the 50 columns where NAs are present. I could write a line of code for each column (in my example below), but I thought there must be a way to do this with a function that would reduce the amount of code I need to write.

Here is an example with four columns.

set.seed(1)
dat <- data.frame(one = rnorm(15),
                  two = sample(LETTERS, 15),
                  three = rnorm(15),
                  four = runif(15))
dat <- data.frame(lapply(dat, function(x) { x[sample(15, 5)] <- NA; x }))
head(dat)
str(dat)
dat$two <- as.character(dat$two)

dat[["one"]][is.na(dat[["one"]])] <- 0
dat[["two"]][is.na(dat[["two"]])] <- 'none'
dat[["three"]][is.na(dat[["three"]])] <- 99
dat[["four"]][is.na(dat[["four"]])] <- 0
head(dat)

I thought a starting point would be to modify this function:

convert.nas <- function(obj,types){
  for (i in 1:length(obj)){
    FUN <- switch(types[i],character = as.character, 
                  numeric = as.numeric, 
                  factor = as.factor,
                  date = as.Date)
    obj[,i] <- FUN(obj[,i])
  }
  obj
}

EDIT: Per suggestions/comments by others, I'll provide some additional context and clarification. I need to remove the NAs due to additional data manipulations (subscripting in particular) occurring later in my script. However, I do appreciate the point made by @Ananda about this making my data less usable. In regards to @Henrik's comment about the criteria between choosing 99 or 0, there is no actual 'criteria' in a logical sense, it is just specific to three columns that I need to define manually.

-al

cherrytree
  • 1,561
  • 3
  • 16
  • 33
  • Why do we have to get moles involved? :-) – A5C1D2H2I1M1N2O1R2T1 Jul 08 '14 at 15:29
  • 1
    Why do you want to do this? It will ultimately make your dataset *less* usable. If you are looking for more sophisticated `NA` handling, perhaps you should look at the "memisc" package. I've demonstrated its `NA` options [at this answer](http://stackoverflow.com/a/16130402/1270695). – A5C1D2H2I1M1N2O1R2T1 Jul 08 '14 at 15:33
  • In any case you need to clearly describe in words the criteria for replacement of `NA`. E.g. it is not clear to me why `NA`s in "one" (a numeric) are replaced with 0, whereas those in "three" (also numeric) are replaced by 99. – Henrik Jul 08 '14 at 15:39

3 Answers3

1

You could change many columns at the same time:

columns_to_change <- c("one","four")
dat[columns_to_change] <- lapply(dat[columns_to_change], function(x) replace(x, is.na(x), 0))
columns_to_change <- c("two")
dat[columns_to_change] <- lapply(dat[columns_to_change], function(x) replace(x, is.na(x), "none"))
columns_to_change <- c("three")
dat[columns_to_change] <- lapply(dat[columns_to_change], function(x) replace(x, is.na(x), 99))

or without code repetition:

L <- list(
   list(cols = c("one","four"), replacement = 0),
   list(cols = c("two"), replacement = "none"),
   list(cols = c("three"), replacement = 99)
)
for (pars in L) {
    dat[pars$cols] <- lapply(
        dat[pars$cols]
        , function(x) replace(x, is.na(x), pars$replacement)
    )
}
Marek
  • 49,472
  • 15
  • 99
  • 121
  • The solution (w/o code repetition) provided by @Marek is the one I chose for this particular problem. I liked the short function and the ability to refer to the columns by their names. I can also just change a few of the columns if needed. – cherrytree Jul 08 '14 at 16:18
0

You could try (Assuming that second column is character)

 dat[is.na(dat)] <- c(0,'none',99,0)[col(dat)][is.na(dat)]

@Marek is right that it converts the columns to character class. It could be fixed by

 dat[] <-  lapply(dat, function(x) if(!any(grepl("[[:alpha:]]+",x))) as.numeric(x) else x)

but, it is ugly.

Update

You could instead do:

 dat[is.na(dat)] <- list(0,'none',99,0)[col(dat)][is.na(dat)]
 dat[] <- lapply(dat, unlist)
 str(dat)
 # 'data.frame':    15 obs. of  4 variables:
 # $ one  : num  0 0.184 -0.836 0 0.33 ...
 # $ two  : chr  "M" "O" "L" "E" ...
 # $ three: num  0.8042 -0.0571 0.5036 99 99 ...
 # $ four : num  0.892 0 0.39 0 0.961 ...

    
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Maybe you're looking for a function like the following:

naSwitcher <- function(indf, cols, naType) {
  if (length(cols) != length(naType)) stop("Something's wrong")
  indf[cols] <- lapply(seq_along(indf[cols]), function(x) {
    switch(naType[x],
           "0" = { indf[cols[x]][is.na(indf[cols[x]])] <- 0; indf[cols[x]] },
           "none" = { indf[cols[x]][is.na(indf[cols[x]])] <- "none"; indf[cols[x]] },
           "99" = { indf[cols[x]][is.na(indf[cols[x]])] <- 99; indf[cols[x]] },
           "NA" = { indf[cols[x]] },
           stop("naType must be either '0', 'none', '99', or 'NA'"))    
  })
  indf
}

Here's how you could use it:

head(naSwitcher(dat, 1:4, c("0", "none", "99", "99")))
#          one  two       three       four
# 1  0.0000000    M  0.80418951  0.8921983
# 2  0.1836433    O -0.05710677 99.0000000
# 3 -0.8356286    L  0.50360797  0.3899895
# 4  0.0000000    E 99.00000000 99.0000000
# 5  0.3295078    S 99.00000000  0.9606180
# 6 -0.8204684 none -1.28459935  0.4346595

(But I recommend sticking to NA values...)

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485