1

In a recent project, I have quite a big data frame. And I'd like to reprogram certain variables using a vector that I defined earlier.

I know there are many other ways to recode the data, but I was wondering if I could use the vector because it seems like an elegant solution.

df <- data.frame(
  A = c(1,2,2,1),
  B = c(1,1,1,2),
  C = c(2,2,1,2)
)


vector <- c(
  "A",
  "B"
)

Consider this example. Here I have created a vector, which consists of 2 Names in the Data set. Can I now use this vector to reprogram the data frame? E.g. I'd like to change all '1' to a '0' in the columns 'A' and 'B'.

I tried this:

df[df[,vector]==1] <- 0

Yet this code only works, when i define the Vector like this:

vector <- c(
  "A",
  "B",
  "C"
)

Therefore, when it includes all the variables in the data frame.

If I use the same code, but the vector does only include 'A' and 'B', i get the following error:

Error in `[<-.data.frame`(`*tmp*`, df[, vector] == 2, value = 1) : 
  unsupported matrix index in replacement

Do you have an Idea on how this might work?

Kind regards

Linus
  • 41
  • 5
  • 4
    `df[, vector] <- replace(df[, vector], df[, vector] == 1, 0)` – Maël Feb 27 '23 at 13:15
  • That worked, thanks! Do you think it is also possible to use a vector to change the class of those columns? like so: ```ds[,varnames]<- as.numeric(ds[,varnames])``` That didn't work for me though... – Linus Feb 27 '23 at 14:03
  • 1
    Nevermind, I figured it out: ```ds[,varnames] <- sapply(ds[,varnames],as.numeric)``` – Linus Feb 27 '23 at 14:20
  • Have a look at [Replace all NA with FALSE in selected columns in R](https://stackoverflow.com/q/7279089/10488504) – GKi Feb 28 '23 at 08:18

2 Answers2

1

You can use mutate(across()) from dplyr.

mutate(df,across(all_of(vector),\(v) replace(v,v==1,0)))
langtang
  • 22,248
  • 1
  • 12
  • 27
0

A base way could be to subset df with vector and then subset this where df[vector]==1.

df[,vector][df[,vector]==1] <- 0
#df[vector][df[vector]==1] <- 0 #Alternative

df
#  A B C
#1 0 0 2
#2 2 0 2
#3 2 0 1
#4 0 2 2

Another way could be to use a for loop.

for(i in vector) df[[i]][df[[i]]==1] <- 0
#for(i in vector) df[,i][df[,i]==1] <- 0 #Variant

Benchmark

bench::mark(check=FALSE,
langtang = local({df <- dplyr::mutate(df,dplyr::across(all_of(vector),\(v) replace(v,v==1,0)))}),
"Maël" = local({df[, vector] <- replace(df[, vector], df[, vector] == 1, 0)}),
GKi = local({df[,vector][df[,vector]==1] <- 0}),
GKi2 = local(for(i in vector) df[,i][df[,i]==1] <- 0),
GKi3 = local(for(i in vector) df[[i]][df[[i]]==1] <- 0)
)
#  expression      min median itr/s…¹ mem_al…² gc/se…³ n_itr  n_gc total…⁴ result
#  <bch:expr> <bch:tm> <bch:>   <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t> <list>
#1 langtang     2.66ms    3ms    299.   7.89KB    8.37   143     4   478ms <NULL>
#2 Maël       219.56µs  241µs   4017.     280B   12.3   1955     6   487ms <NULL>
#3 GKi        222.48µs  243µs   4013.     280B   12.3   1951     6   486ms <NULL>
#4 GKi2       106.96µs  116µs   8452.     280B   12.3   4119     6   487ms <NULL>
#5 GKi3        60.75µs   65µs  15217.     280B   14.4   7398     7   486ms <NULL>

The for loop is about 3 times faster than the other base variants and about 50 times faster than the dplyr variant. All base variants use less memory compared to the dplyr variant.

GKi
  • 37,245
  • 2
  • 26
  • 48