23

I have a question similar to this one, but my dataset is a bit bigger: 50 columns with 1 column as UID and other columns carrying either TRUE or NA, I want to change all the NA to FALSE, but I don't want to use explicit loop.

Can plyr do the trick? Thanks.

UPDATE #1

Thanks for quick reply, but what if my dataset is like below:

df <- data.frame(
  id = c(rep(1:19),NA),
  x1 = sample(c(NA,TRUE), 20, replace = TRUE),
  x2 = sample(c(NA,TRUE), 20, replace = TRUE)
)

I only want X1 and X2 to be processed, how can this be done?

Community
  • 1
  • 1
lokheart
  • 23,743
  • 39
  • 98
  • 169

6 Answers6

34

If you want to do the replacement for a subset of variables, you can still use the is.na(*) <- trick, as follows:

df[c("x1", "x2")][is.na(df[c("x1", "x2")])] <- FALSE

IMO using temporary variables makes the logic easier to follow:

vars.to.replace <- c("x1", "x2")
df2 <- df[vars.to.replace]
df2[is.na(df2)] <- FALSE
df[vars.to.replace] <- df2
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • 3
    I know this is an old post, but would you explain the first line to me? I get the logic when you break it down using temp variables, but I'd like to understand the one line form. I thought I was familiar with subsetting but I don't understand the [][]. I searched "double brackets" but that turned up something different. – tmakino May 02 '13 at 13:57
  • 3
    @tmakino You just have to read the double brackets as different subsets from left to right. For example, if `x <- 1:10`, then `x[5:10][1:4]` will give you the vector `5 6 7 8`. In multiple steps, you could take the first subset and call it y, `y <- x[5:10]` which is `5 6 7 8 9 10`. And then subset that vector `y[1:4]`, which gives you `5 6 7 8` again. – blakeoft Oct 07 '14 at 14:39
  • You can also use the column position instead of explicitly naming them, which is useful when you have a lot of variables to convert or if they have long names: `df2[,14:16][is.na(df2[,14:16])] <- 0`, for instance, replaces `NA` with `0` in columns 14, 15, and 16 of data frame, df2. – coip May 07 '15 at 14:54
17

tidyr::replace_na excellent function.

df %>%
  replace_na(list(x1 = FALSE, x2 = FALSE))

This is such a great quick fix. the only trick is you make a list of the columns you want to change.

mtelesha
  • 2,079
  • 18
  • 16
9

Try this code:

df <- data.frame(
  id = c(rep(1:19), NA),
  x1 = sample(c(NA, TRUE), 20, replace = TRUE),
  x2 = sample(c(NA, TRUE), 20, replace = TRUE)
)
replace(df, is.na(df), FALSE)

UPDATED for an another solution.

df2 <- df <- data.frame(
  id = c(rep(1:19), NA),
  x1 = sample(c(NA, TRUE), 20, replace = TRUE),
  x2 = sample(c(NA, TRUE), 20, replace = TRUE)
)
df2[names(df) == "id"] <- FALSE
df2[names(df) != "id"] <- TRUE
replace(df, is.na(df) & df2, FALSE)
Triad sou.
  • 2,969
  • 3
  • 23
  • 27
4

With dplyr you could also do

df %>% mutate_each(funs(replace(., is.na(.), F)), x1, x2)

It is a bit less readable compared to just using replace() but more generic as it allows to select the columns to be transformed. This solution especially applies if you want to keep NAs in some columns but want to get rid of NAs in others.

Holger Brandl
  • 10,634
  • 3
  • 64
  • 63
4

You can use the NAToUnknown function in the gdata package

df[,c('x1', 'x2')] = gdata::NAToUnknown(df[,c('x1', 'x2')], unknown = 'FALSE')
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • 3
    Excellent function except for one snag - if I want to change unknowns to 0, and I already have some NAs and zeroes in the vector, then I receive the error message `Error in NAToUnknown.default(x = dots[[1L]][[1L]], unknown = dots[[2L]][[1L]], : 'x' already has value “0”`. – Jubbles Mar 01 '12 at 19:22
0

An option would be to use a for loop.

for(i in c("x1", "x2")) df[[i]][is.na(df[[i]])] <- FALSE

Benchmark

set.seed(42)
df <- data.frame(
  id = c(rep(1:19),NA),
  x1 = sample(c(NA,TRUE), 20, replace = TRUE),
  x2 = sample(c(NA,TRUE), 20, replace = TRUE)
)

bench::mark(check=FALSE,
"Holger Brandl" = local(dplyr::mutate_each(df, dplyr::funs(replace(., is.na(.), F)), x1, x2)),
"mtelesha" = local(df <- tidyr::replace_na(df, list(x1 = FALSE, x2 = FALSE))),
Ramnath = local(df[,c('x1', 'x2')] <- gdata::NAToUnknown(df[,c('x1', 'x2')], unknown = 'FALSE')),
"Hong Ooi" = local(df[c("x1", "x2")][is.na(df[c("x1", "x2")])] <- FALSE),
GKi = local(for(i in c("x1", "x2")) df[[i]][is.na(df[[i]])] <- FALSE) )
#  expression         min   median `itr/sec` mem_al…¹ gc/se…² n_itr  n_gc total…³
#  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t>
#1 Holger Brandl  16.93ms  17.33ms      57.6  34.43KB    19.2    21     7   365ms
#2 mtelesha        3.94ms   4.39ms     226.    8.15KB    13.1   103     6   456ms
#3 Ramnath       400.28µs 415.44µs    2381.    1.55KB    16.7  1142     8   480ms
#4 Hong Ooi      196.87µs 206.72µs    4755.      488B    18.8  2276     9   479ms
#5 GKi             61.8µs  66.16µs   14808.      280B    20.9  7076    10   478ms

The for-loop is about 3 times faster than Hong Ooi the second and uses the lowest amount of memory.

GKi
  • 37,245
  • 2
  • 26
  • 48