2

Suppose that I have the following data

df <- structure(list(car_model = c(301, 302, 303, 304), colour = c(501, 
502, 503, 504), sales = c(182, 191, 302, 101)), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

and I have a single lookup table where I will get the texts to replace the codes in the columns car_model and colour.

tbl1 <- structure(list(txt = c("A", "B", "C", "Y"), cod = c(301, 302, 
303, 304), var = c("car_model", "car_model", "car_model", "car_model"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
tbl2 <- structure(list(txt = c("black", "green", "red", "white"), cod = c(501, 
502, 503, 504), var = c("colour", "colour", "colour", "colour"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

combining the two tables I have

tbl <- rbind(tbl1,tbl2)
# A tibble: 8 x 3
  txt     cod var      
  <chr> <dbl> <chr>    
1 A       301 car_model
2 B       302 car_model
3 C       303 car_model
4 Y       304 car_model
5 black   501 colour   
6 green   502 colour   
7 red     503 colour   
8 white   504 colour   

Is there a way to replace all the columns in the main df using the lookup table this way (matching the column names by the value in the columns var and cod) or I need to make separate tables, one for each variable? Another doubt that I have is if it's reasonable to do it in a dataset with ~10 million rows, 30 or more variables and a lookup table with total size ~ 5 thousand rows.

EDIT: About the codes is possible to have the same code in different variables.

EDIT2: I'm looking for a fast and memory efficient solution. Maybe some solution with data.table

R. Cowboy
  • 223
  • 1
  • 7
  • Are the `cod`s unique? What I want to know: is `301` always a car_model and `501` always a colour or it is possible, that there is car_model named `501`? – Martin Gal Jun 12 '21 at 21:43
  • @MartinGal no the codes are not unique, is possible to have a `car_model` with the same code that a `colour`. The codes are unique just for each variable – R. Cowboy Jun 12 '21 at 21:46

3 Answers3

3

A data.table option

cbind(unstack(setDT(tbl)[melt(
  setDT(df)[, .(car_model, colour)], ,
  variable.name = "var",
  value.name = "cod"
), .(txt, var), on = .(var, cod)]), df[, .(sales)])

gives

  car_model colour sales
1         A  black   182
2         B  green   191
3         C    red   302
4         Y  white   101
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
2

Here is one way with tidyverse

  1. Loop across the columns found in the unique values from 'var' column of 'tbl'
  2. Get the column name of looped column with cur_column() to create a logical expression on the 'var' column of 'tbl' ('i1')
  3. Use match to get the position index where the column values match with subset of 'cod' column of 'tbl'
  4. Extract the corresponding 'txt' column of 'tbl' from the subset based on 'i1'
library(dplyr)
df <- df %>% 
    mutate(across(all_of(unique(tbl$var)),
     ~ {i1 <- tbl$var == cur_column()
       tbl$txt[i1][match(., tbl$cod[i1])]}))

-output

df
# A tibble: 4 x 3
  car_model colour sales
  <chr>     <chr>  <dbl>
1 A         black    182
2 B         green    191
3 C         red      302
4 Y         white    101

Or with data.table, we may use the same method

  1. Created a named vector from 'tbl' ('nm1')
  2. Convert the 'data.frame' to 'data.table' (setDT)
  3. Specify the columns of interest in .SDcols from the unique element of 'var'
  4. Do the match by looping with Map and assign (:=) the output back to the original columns
library(data.table)
nm1 <- setNames(tbl$txt, tbl$cod)
un1 <- unique(tbl$var)
setDT(df)[, (un1) := Map(function(x, y) 
     nm1[tbl$var == y][as.character(x)], .SD,  un1), .SDcols = un1]

-output

df
   car_model colour sales
1:         A  black   182
2:         B  green   191
3:         C    red   302
4:         Y  white   101

Or may use base R

lst1 <- with(tbl, split(setNames(txt, cod), var))
df[un1] <- Map(function(x, y)  y[as.character(x)], df[un1], lst1)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks akrun, is there a way to do it with `data.table` ? Is dplyr more memory efficient and fast than data.table for it? – R. Cowboy Jun 12 '21 at 21:55
2

You can reshape the data and perform the join.

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = -sales) %>%
  left_join(tbl, by = c('name' = 'var', 'value' = 'cod')) %>%
  select(-value) %>%
  pivot_wider(names_from = name, values_from = txt)

#  sales car_model colour
#  <dbl> <chr>     <chr> 
#1   182 A         black 
#2   191 B         green 
#3   302 C         red   
#4   101 Y         white 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213