R: Replace multiple values in multiple columns of dataframes with NA

Question

I am trying to achieve something similar to this question but with multiple values that must be replaced by NA, and in large dataset.

df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = rep(1:9), var2 = rep(3:5, each = 3))

which generates this dataframe:

df
  name foo var1 var2
1    a   1    1    3
2    a   2    2    3
3    a   3    3    3
4    b   4    4    4
5    b   5    5    4
6    b   6    6    4
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

I would like to replace all occurrences of, say, 3 and 4 by NA, but only in the columns that start with "var".

I know that I can use a combination of [] operators to achieve the result I want:

df[,grep("^var[:alnum:]?",colnames(df))][ 
        df[,grep("^var[:alnum:]?",colnames(df))] == 3 |
        df[,grep("^var[:alnum:]?",colnames(df))] == 4
   ] <- NA

df
  name foo var1 var2
1    a   1    1    NA
2    a   2    2    NA
3    a   3    NA   NA
4    b   4    NA   NA
5    b   5    5    NA
6    b   6    6    NA
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

Now my questions are the following:

Is there a way to do this in an efficient way, given that my actual dataset has about 100.000 lines, and 400 out of 500 variables start with "var". It seems (subjectively) slow on my computer when I use the double brackets technique.
How would I approach the problem if instead of 2 values (3 and 4) to be replaced by NA, I had a long list of, say, 100 various values? Is there a way to specify multiple values with having to do a clumsy series of conditions separated by | operator?

thelatemail · Accepted Answer · 2014-09-11T04:54:39.620

16

You can also do this using replace:

sel <- grepl("var",names(df))
df[sel] <- lapply(df[sel], function(x) replace(x,x %in% 3:4, NA) )
df

#  name foo var1 var2
#1    a   1    1   NA
#2    a   2    2   NA
#3    a   3   NA   NA
#4    b   4   NA   NA
#5    b   5    5   NA
#6    b   6    6   NA
#7    c   7    7    5
#8    c   8    8    5
#9    c   9    9    5

Some quick benchmarking using a million row sample of data suggests this is quicker than the other answers.

edited Sep 11 '14 at 04:54

answered Sep 11 '14 at 04:47

thelatemail

91,185
12
128
188

On my data I went from 17 seconds to 1.8 seconds, a ten-fold decrease compared to @akrun method! Thanks! – Peutch Sep 11 '14 at 09:26
1

Any idea whether it is possible to count how many values have been changed? – Peutch Sep 22 '14 at 15:01

score 7 · Answer 2 · answered Sep 10 '14 at 15:01

7

You could also do:

col_idx <- grep("^var", names(df))
values <- c(3, 4)
m1 <- as.matrix(df[,col_idx])
m1[m1 %in% values] <- NA
df[col_idx]  <- m1
df
#   name foo var1 var2
#1    a   1    1   NA
#2    a   2    2   NA
#3    a   3   NA   NA
#4    b   4   NA   NA
#5    b   5    5   NA
#6    b   6    6   NA
#7    c   7    7    5
#8    c   8    8    5
#9    c   9    9    5

answered Sep 10 '14 at 15:01

akrun

874,273
37
540
662

Thank you. With my data, this solution turns out to be 6 or 7 times faster than the `sapply` method. – Peutch Sep 10 '14 at 15:33
@Peutch - I think I've squeezed a fraction more speed out of this with `replace` - could you test on your actual data? – thelatemail Sep 11 '14 at 05:01

GuedesBF · Answer 3 · 2021-07-04T23:56:14.733

Since dplyr 1.0.0 (early 2020), I believe the dplyr approach would be:

library(dplyr)
df %>% mutate(across(starts_with('var'), ~replace(., . %in% c(3,4), NA)))

  name foo var1 var2
1    a   1    1   NA
2    a   2    2   NA
3    a   3   NA   NA
4    b   4   NA   NA
5    b   5    5   NA
6    b   6    6   NA
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

An alternative approach using the naniar package, which neatly imputes missing values to selected columns using a predicate function (here with str_detect()):

library(dplyr)
library(stringr)
library(naniar)

df%>%replace_with_na_if(str_detect(names(.), '^var'), ~.%in%c(3,4))

It would be very nice to see the naniar package updated to work with current tidyselect synthax with across(), and its selection helpers, with something like: df%>%mutate(across(starts_with('var'), replace_with_na_all(condition=~.%in% c(3, 4))))

Sven Hohenstein · Answer 4 · 2014-09-10T14:56:28.977

Here's an approach:

# the values that should be replaced by NA
values <- c(3, 4)

# index of columns
col_idx <- grep("^var", names(df))
# [1] 3 4

# index of values (within these columns)
val_idx <- sapply(df[col_idx], "%in%", table = values)
#        var1  var2
#  [1,] FALSE  TRUE
#  [2,] FALSE  TRUE
#  [3,]  TRUE  TRUE
#  [4,]  TRUE  TRUE
#  [5,] FALSE  TRUE
#  [6,] FALSE  TRUE
#  [7,] FALSE FALSE
#  [8,] FALSE FALSE
#  [9,] FALSE FALSE

# replace with NA
is.na(df[col_idx]) <- val_idx

df
#   name foo var1 var2
# 1    a   1    1   NA
# 2    a   2    2   NA
# 3    a   3   NA   NA
# 4    b   4   NA   NA
# 5    b   5    5   NA
# 6    b   6    6   NA
# 7    c   7    7    5
# 8    c   8    8    5
# 9    c   9    9    5

A5C1D2H2I1M1N2O1R2T1 · Answer 5 · 2014-09-11T18:53:52.107

~~I haven't timed this option, but~~ I have written a function called makemeNA that is part of my GitHub-only "SOfun" package.

With that function, the approach would be something like this:

library(SOfun)

Cols <- grep("^var", names(df))
df[Cols] <- makemeNA(df[Cols], NAStrings = as.character(c(3, 4)))
df
#   name foo var1 var2
# 1    a   1    1   NA
# 2    a   2    2   NA
# 3    a   3   NA   NA
# 4    b   4   NA   NA
# 5    b   5    5   NA
# 6    b   6    6   NA
# 7    c   7    7    5
# 8    c   8    8    5
# 9    c   9    9    5

The function uses the na.strings argument in type.convert to do the conversion to NA.

Install the package with:

library(devtools)
install_github("SOfun", "mrdwab")

(or your favorite method of installing packages from GitHub).

Here's some benchmarking. I've decided to make things interesting and replace both numeric and non-numeric values with NA to see how things compare.

Here's the sample data:

n <- 1000000
set.seed(1)
df <- data.frame(
  name1 = sample(letters[1:3], n, TRUE), 
  name2 = sample(letters[1:3], n, TRUE),
  name3 = sample(letters[1:3], n, TRUE),
  var1 = sample(9, n, TRUE), 
  var2 = sample(5, n, TRUE),
  var3 = sample(9, n, TRUE))

Here are the functions to test:

fun1 <- function() {
  Cols <- names(df)
  df[Cols] <- makemeNA(df[Cols], NAStrings = as.character(c(3, 4, "a")))
  df
}

fun2 <- function() {
  values <- c(3, 4, "a")
  col_idx <- names(df)
  m1 <- as.matrix(df)
  m1[m1 %in% values] <- NA
  df[col_idx]  <- m1
  df
}

fun3 <- function() {
  values <- c(3, 4, "a")
  col_idx <- names(df)
  val_idx <- sapply(df[col_idx], "%in%", table = values)
  is.na(df[col_idx]) <- val_idx
  df
}

fun4 <- function() {
  sel <- names(df)
  df[sel] <- lapply(df[sel], function(x) 
    replace(x, x %in% c(3, 4, "a"), NA))
  df
}

I'm breaking out fun2 and fun3. I'm not crazy about fun2 because it converts everything to the same type. I also expect fun3 to be slower.

system.time(fun2())
#    user  system elapsed 
#    4.45    0.33    4.81 

system.time(fun3())
#    user  system elapsed 
#   34.31    0.38   34.74

So now it comes down to me and Thela...

library(microbenchmark)
microbenchmark(fun1(), fun4(), times = 50)
# Unit: seconds
#    expr      min       lq   median       uq      max neval
#  fun1() 2.934278 2.982292 3.070784 3.091579 3.617902    50
#  fun4() 2.839901 2.964274 2.981248 3.128327 3.930542    50

Dang you Thela!

+1 I converted it to matrix as the example showed `numeric` columns to compare — akrun, Sep 11 '14 at 12:29

score 0 · Answer 6 · answered Feb 06 '19 at 20:20

I think dplyr is very well-suited for this task.
Using replace() as suggested by @thelatemail, you could do something like this:

library("dplyr")
df <- df %>% 
  mutate_at(vars(starts_with("var")),
            funs(replace(., . %in% c(3, 4), NA)))

df
#   name foo var1 var2
# 1    a   1    1   NA
# 2    a   2    2   NA
# 3    a   3   NA   NA
# 4    b   4   NA   NA
# 5    b   5    5   NA
# 6    b   6    6   NA
# 7    c   7    7    5
# 8    c   8    8    5
# 9    c   9    9    5

funs is deprectated – Martien Lubberink Sep 12 '20 at 19:42 — Martien Lubberink, Sep 12 '20 at 19:42

score -3 · Answer 7 · answered May 07 '15 at 21:48

-3

Here is a dplyr solution:

# Define replace function
repl.f <- function(x) ifelse(x%in%c(3,4), NA,x)

library(dplyr)
cbind(select(df, -starts_with("var")),
  mutate_each(select(df, starts_with("var")), funs(repl.f)))

  name foo var1 var2
1    a   1    1   NA
2    a   2    2   NA
3    a   3   NA   NA
4    b   4   NA   NA
5    b   5    5   NA
6    b   6    6   NA
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

answered May 07 '15 at 21:48

Tomiris

69
5

I don't think using `mutate_each()` or it's up-to-date equivalent `mutate_all()` in this way makes sense (anymore). I'm not sure if this was possible in 2015 but nowadays you should use `mutate_at(vars(starts_with("var"), ...)` which is both more elegant and faster than the `mutate_each()`-`select()`-`cbind()`-approach – statmerkur Feb 06 '19 at 20:38
Agree with statmerkur, try instead using updated dplyr language, mutate(across(...)) as has been noted elsehwere. – glenn_in_boston Nov 05 '21 at 13:10

R: Replace multiple values in multiple columns of dataframes with NA

7 Answers7

Linked

Related