How to replace factor levels in multiples columns of a data frame based on the match lookup data frame using R

Question

The levels in df1 that matches the lab_pt in the dataframe lookup_df I would like to replace with the corresponding leves in the second column of the lookup_df (this here: lab_en). But I want to keep the rest as it is. Thanks a lot!

--

Main dataframe

df1 <- data.frame(
            num_var = sample(200, 15),
            col1 = rep(c("onda","estrela","rato","caneta","ceu"), 3),
            col2 = rep(c("muro","gato","pa","rato","ceu"), 3),
            col3 = rep(c("surf","onda","dente","onda","sei"), 3),
            col3 = rep(c("onda","casa",NA,"nao","net"), 3))

Lookeup data frame

lookup_df <- data.frame(
            lab_pt = c("onda","estrela","rato","caneta","ceu"),
            lab_en = c("wave","star","rat","pen","sky"))

I have tried this here below . It does the job, but the non matching information is transformed to NAs and this I don't want.

rownames(lookup_df) <- lookup_df$lab_pt
apply(df1[,2:ncol(df1)], 2, function(x) lookup_df[as.character(x),]$lab_en)

This post here is quite similar, but in that case all the levels are matchable, different from this here. Thanks a lot! Replace values in a dataframe based on lookup table

score 1 · Answer 1 · answered Dec 14 '18 at 12:07

I think this should do it, with data.table package. It does re-order the id's, is this a problem?

# added seed
# changed col3 to col4
set.seed(1)
df1 <- data.frame(
  num_var = sample(200, 15),
  col1 = rep(c("onda","estrela","rato","caneta","ceu"), 3),
  col2 = rep(c("muro","gato","pa","rato","ceu"), 3),
  col3 = rep(c("surf","onda","dente","onda","sei"), 3),
  col4 = rep(c("onda","casa",NA,"nao","net"), 3))

lookup_df <- data.frame(
  lab_pt = c("onda","estrela","rato","caneta","ceu"),
  lab_en = c("wave","star","rat","pen","sky"))

# data.table solution
library(data.table)

# change from wide to long, to make merge easier
dt <- melt(as.data.table(df1), id.vars="num_var")

# merge in the new values to original data
dt2 <- merge(dt, lookup_df, by.x="value", by.y="lab_pt",
             all.x=TRUE)

# if its missing, replace with original value
dt2[is.na(lab_en), lab_en := value]

# convert back from long to wide
dt3 <- dcast(dt2[, .(num_var, variable, lab_en)], num_var~variable,
            value.var="lab_en")

# back to data.frame
output <- as.data.frame(dt3)

Whenever you are doing merges between tables, its usually nicer to work with long format data, where you have a group column and a value column. It means you don't need to run the same operation multiple times (the merge).

Thanks a lot! Your approach was very good. I will use that! I would just add this here (see below) to sort the dataframe to the original form. `output[ order(match(output$num_var, df1$num_var)), ]` — Jai_surf_code, Dec 14 '18 at 17:49

score 1 · Answer 2 · answered Dec 14 '18 at 12:11

1

I think this might help you it will although create a new column but will do the work

df1$new <- lookup_df[match(df1$col1, lookup_df$lab_pt),2]

answered Dec 14 '18 at 12:11

Hunaidkhan

1,411
2
11
21

Okay. Thanks! This is a nice job indeed, but replace only one column. Maybe I will use your approach in a loop over all the columns in a dataframe. – Jai_surf_code Dec 14 '18 at 17:19

s_baldur · Answer 3 · 2018-12-14T13:04:33.583

You could do something like the following:

lookup_vec <- setNames(as.character(lookup_df[["lab_en"]]), lookup_df[["lab_pt"]])
#   onda estrela    rato  caneta     ceu 
# "wave"  "star"   "rat"   "pen"   "sky" 
factors_vars <- names(df1)[sapply(df1, is.factor)]
for (var in factors_vars) {
  w <- which(levels(df1[[var]]) %in% names(lookup_vec)) # Get only those that are "matchable"
  levels(df1[[var]])[w] <- lookup_vec[levels(df1[[var]])[w]]
}
df1

   num_var col1 col2  col3 col3.1
1       21 wave muro  surf   wave
2      104 star gato  wave   casa
3       60  rat   pa dente   <NA>
4      183  pen  rat  wave    nao
5      123  sky  sky   sei    net
6       17 wave muro  surf   wave
7       34 star gato  wave   casa
8      126  rat   pa dente   <NA>
9      139  pen  rat  wave    nao
10      35  sky  sky   sei    net
11     149 wave muro  surf   wave
12       8 star gato  wave   casa
13      46  rat   pa dente   <NA>
14      32  pen  rat  wave    nao
15     162  sky  sky   sei    net

Man! Thanks a lot! I put your very simple and effective thoughts in a function and it is **working fantastically!** Actually this function should be further developed and be included in a proper package for data manipulation! Very useful. — Jai_surf_code, Dec 15 '18 at 18:39

otwtm · Answer 4 · 2018-12-15T11:27:33.787

Here is a solution using the dplyr package. Notice the argument stringAsFactor=F to keep the words as strings.

   df1 <- data.frame(
      num_var = sample(200, 15),
      col1 = rep(c("onda","estrela","rato","caneta","ceu"), 3),
      col2 = rep(c("muro","gato","pa","rato","ceu"), 3),
      col3 = rep(c("surf","onda","dente","onda","sei"), 3),
      col3 = rep(c("onda","casa",NA,"nao","net"), 3), stringsAsFactors = F)

    lookup_df <- data.frame(
      lab_pt = c("onda","estrela","rato","caneta","ceu"),
      lab_en = c("wave","star","rat","pen","sky"), stringsAsFactors = F)


    library(dplyr)

    df1 %>% mutate(col1=replace(col1, col1 %in% lookup_df$lab_pt, lookup_df$lab_en)) %>% 
      mutate(col2=replace(col2, col2 %in% lookup_df$lab_pt, lookup_df$lab_en)) %>% 
      mutate(col3=replace(col3, col3 %in% lookup_df$lab_pt, lookup_df$lab_en)) %>%
      mutate(col3.1=replace(col3.1, col3.1 %in% lookup_df$lab_pt, lookup_df$lab_en))

I admit it is a bit tedious to use one line for each column of the dataframe. Couldn't find a way to do it for all columns at once.

   num_var col1 col2  col3 col3.1
1        6 wave muro  surf   wave
2       84 star gato  wave   casa
3      146  rat   pa dente   <NA>
4      133  pen wave  star    nao
5       47  sky star   sei    net
6      116 wave muro  surf   star
7       81 star gato   rat   casa
8      118  rat   pa dente   <NA>
9      186  pen  rat   pen    nao
10     161  sky  pen   sei    net
11     135 wave muro  surf    rat
12      31 star gato   sky   casa
13     174  rat   pa dente   <NA>
14     187  pen  sky  wave    nao
15     178  sky wave   sei    net

score 1 · Accepted Answer · answered Aug 14 '20 at 18:09

# Fake dataframe
df1 <- tibble(
        num_var = sample(200, 15),
        col1 = rep(c("onda","estrela","rato","caneta","ceu"), 3),
        col2 = rep(c("muro","gato","pa","rato","ceu"), 3),
        col3 = rep(c("surf","onda","dente","onda","sei"), 3),
        col4 = rep(c("onda","casa",NA,"nao","net"), 3))

# Lookup dictionary dataframe
lookup_dat <- tibble(
        lab_pt = c("onda","estrela","rato","caneta","ceu"),
        lab_en = c("wave","star","rat","pen","sky")) 

#******************************************************************
#
# Translation by replacement of lookup dictionary 
# Developed to generate Rmd report with labels of plots in different languages
replace_level <- function(df, lookup_df, col_langu_in, col_langu_out){
        library(data.table)
        # function to replace levels in the df given a reference list in 
        # another df when level match it replace with the correspondent 
        #level in the same row name but in other column.
        # !!!! Variables col_langu need to be quoted 
           # 1) Below it creates a dictionary style with the reference df (2cols)
         lookup_vec <- setNames(as.character(lookup_df[[col_langu_out]]), 
                               lookup_df[[col_langu_in]])
           # 2) iterating over main df col names
         for (i in names(df)) { # select cols?: names(df)[sapply(df, is.factor)]
           # 3) return index of levels from df levels matching with those from 
                 # the dictionary type to replace (for each cols of df i)
                 if(is.character(df[[i]])){df[i] <- as.factor(df[[i]])}
                 # Changing from character to factor before the translation
                 index_match <- which(levels(df[[i]]) %in% 
                                              names(lookup_vec))
           # 4) replacing matchable levels based on the index on step 3).
                 # with the reference to translate
                 levels(df[[i]])[index_match] <- 
                         lookup_vec[levels(df[[i]])[index_match]]}
         return(df)}

# test here
replace_level(df1, lookup_dat, "lab_pt", "lab_en")

Now it is ready to be used. It may interest someone. It is working for me ;-). — Jai_surf_code, Aug 14 '20 at 18:16

How to replace factor levels in multiples columns of a data frame based on the match lookup data frame using R

5 Answers5