0

I have a data set with many columns (DATA_OLD) in which I want to exchange all values based on an allocation list with many entries (KEY).

Every value in DATA_OLD should be replaced by its counterpart (can be seen in KEY) to create DATA_NEW.

For simplicity, the example here contains a short KEY and DATA_OLD set. In reality, there are >2500 rows in KEY and >100 columns in DATA_OLD. Therefore, an approach that can be applied to the whole data set simultaneously without calling each colname of DATA_OLD is important.

KEY:

old new
1 1
3 2
7 3
12 4
55 5

Following this example, every value "1" should be replaced with another value "1". Every value "3" should be replaced with value "2". Every value "7" should be replaced with value "3".

DATA_OLD (START):

var1 var2 var3
NA 3 NA
NA 55 NA
1 NA NA
NA NA NA
3 NA NA
55 NA 12

DATA_NEW (RESULT):

var1 var2 var3
NA 2 NA
NA 5 NA
1 NA NA
NA NA NA
2 NA NA
5 NA 4

Here reproducible data:

KEY<-structure(list(old = c(1, 3, 7, 12, 55), new = c(1, 2, 3, 4, 
5)), class = "data.frame", row.names = c(NA, -5L))

DATA_OLD<-structure(list(var1 = c(NA, NA, 1, NA, 3, 55), var2 = c(3, 
55, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 12)), class = "data.frame", row.names = c(NA, -6L))

DATA_NEW<-structure(list(var1 = c(NA, NA, 1, NA, 2, 5), var2 = c(2, 
5, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 4)), class = "data.frame", row.names = c(NA, -6L))

I have tried back and forth, and it appears that I am completely clueless. Help would be greatly apprecciated! The real data set is quite large...

  • 1
    There are many different solutions here: [Canonical tidyverse method to update some values of a vector from a look-up table](https://stackoverflow.com/questions/67081496/canonical-tidyverse-method-to-update-some-values-of-a-vector-from-a-look-up-tabl). Does this answer you question? – jared_mamrot Nov 28 '22 at 11:54
  • 1
    Does this answer your question? [Replace values in data frame based on other data frame in R](https://stackoverflow.com/questions/15069984/replace-values-in-data-frame-based-on-other-data-frame-in-r) – arg0naut91 Nov 28 '22 at 11:56
  • @arg0naut91: Using "match" works in general for the exchange, as long as I name each column of the data frame. Would you know how I can use match simultaneously on all columns of my df? DATA_OLD is a simplification. The original data set is very large. – Julian ter Horst Nov 28 '22 at 12:47
  • Then I'd suggest you modify your example so that it contains "more columns" as I'm not sure what exactly are you referring to - keys or values (or both) etc. – arg0naut91 Nov 28 '22 at 13:01
  • @jared_mamrot: Thank you for your comment. The solution you offer uses the package data.table and also work very well for individual columns. I am struggling how to use that approach simultaneously on all columns of my large data set. – Julian ter Horst Nov 28 '22 at 13:08
  • @arg0naut91: I have edited the description, to highlight the size of the data set! I hope it is clearer now. Thank you for your comment. – Julian ter Horst Nov 28 '22 at 13:16
  • You can check some of the answers here: https://stackoverflow.com/questions/65227663/how-to-match-and-replace-value-from-one-dataframe-to-another – arg0naut91 Nov 28 '22 at 13:25

2 Answers2

1

1) Base R Be careful here since some solutions have the side effect of converting the numeric columns to character or factor or the data frame to something else. A solution using match will generally work. The result of lapply is a list so convert back to data frame.

DATA_OLD |>
  lapply(function(x) with(KEY, new[match(x, old)])) |>
  as.data.frame()

or

DATA_NEW <- DATA_OLD
DATA_NEW[] <- lapply(DATA_OLD, function(x) with(KEY, new[match(x, old)]))

This last one is easy to convert to act only on some columns

DATA_NEW <- DATA_OLD
ix <- 1:2 # only convert these columns
DATA_NEW[ix] <- lapply(DATA_OLD[ix], function(x) with(KEY, new[match(x, old)]))

2) purrr Alternately use map_dfr which returns a data frame directly:

library(purrr)
map_dfr(DATA_OLD, ~ with(KEY, new[match(.x, old)]))

3) dplyr A dplyr solution using across is the following. If there were some non-numeric columns that should not be converted then replace everything() with where(is.numeric)

library(purrr)
DATA_OLD %>%
  mutate(across(everything(), ~ with(KEY, new[match(.x, old)])))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

The simplest way to implement a dictionary in R is a named array, where you can use the names as indices:

key <- setNames(KEY$new, KEY$old)
> key
 1  3  7 12 55 
 1  2  3  4  5 

The only thing to be mindful of is that the indexing must by done by character, rather than integer:

> key[3]
7 
3  # WRONG! This is the 3rd item!
> key["3"]
3 
2  # RIGHT! This is the item named "3"

Then you can apply the transformation column-wise. This turns the data into a matrix, but you can simply turn it back.

as.data.frame(apply(DATA_OLD, 2, \(col) key[as.character(col)]))
  var1 var2 var3
1   NA    2    1
2   NA    5   NA
3    1   NA   NA
4   NA   NA   NA
5    2   NA   NA
6    5   NA    4
Ottie
  • 1,000
  • 3
  • 9