Replacing strings with lookup table dplyr

Question

I am trying to create a look up table in R in order to get my data in the same format as the company that I am working for.

It regards different education categories that I want to merge using dplyr.

library(dplyr)

# Create data
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")

    data <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))

    tbl_df(data)

    # Create lookup table
    lut <- c("Mechanichal Engineering" = "Engineering",
             "Electric Engineering" = "Engineering",
             "Political Science" = "Social Science",
             "Economics" = "Social Science")

    # Assign lookup table
    data$X1 <- lut[data$X1]

But in my output my old values are replace with the wrong ones, i.e. not the ones that I created in the lookup table. Rather it seems like the lookup table is assign randomly.

More generally, make sure the sample code you provide can be run by others (keep comments to after `#`, don't change capitalization in data/Data). Here's a guide http://stackoverflow.com/a/28481250/1191259 — Frank, Jul 08 '15 at 14:59
What about `Data$education <- lut[as.character(data$education)]` — MrFlick, Jul 08 '15 at 14:59
@Frank thanks. fixed. but agree that a reproducible example should be in the question — MrFlick, Jul 08 '15 at 15:00
Perhaps using a join to a map of your preferred values might even be more efficient? — leerssej, Nov 12 '16 at 08:40

score 3 · Accepted Answer · answered Jun 05 '18 at 14:25

I've just been trying to figure this out myself. I wasn't quite happy with most of the solutions I found, so here's what I ended up with. I added an "other" category to show that it works even if there are values not defined in the lookup table.

library(dplyr)

# Create data
education <- c("Mechanichal Engineering",
               "Electric Engineering",
               "Political Science",
               "Economics",
               "Other")

data <- data.frame(X1 = replicate(1, sample(education, 20, rep=TRUE)))

# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
         "Electric Engineering" = "Engineering",
         "Political Science" = "Social Science",
         "Economics" = "Social Science")

data %>%
    mutate(X2 = recode(X1, !!!lut))
#>                         X1             X2
#> 1     Electric Engineering    Engineering
#> 2                    Other          Other
#> 3                    Other          Other
#> 4                    Other          Other
#> 5                    Other          Other
#> 6        Political Science Social Science
#> 7                    Other          Other
#> 8                Economics Social Science
#> 9        Political Science Social Science
#> 10    Electric Engineering    Engineering
#> 11               Economics Social Science
#> 12               Economics Social Science
#> 13 Mechanichal Engineering    Engineering
#> 14               Economics Social Science
#> 15       Political Science Social Science
#> 16                   Other          Other
#> 17                   Other          Other
#> 18                   Other          Other
#> 19 Mechanichal Engineering    Engineering
#> 20       Political Science Social Science

score 2 · Answer 2 · answered Nov 15 '16 at 15:45

2

education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
lut <- list("Mechanichal Engineering" = "Engineering",
            "Electric Engineering" = "Engineering",
            "Political Science" = "Social Science",
            "Economics" = "Social Science")
lut2<-melt(lut)
data1 <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
data1$new <- lut2[match(data1$X1,lut2$L1),'value']
head(data1)


=======================  ==============
X1                       new           
=======================  ==============
Political Science        Social Science
Political Science        Social Science
Mechanichal Engineering  Engineering   
Mechanichal Engineering  Engineering   
Political Science        Social Science
Political Science        Social Science
=======================  ==============

answered Nov 15 '16 at 15:45

Carl Boneri

2,632
1
13
15

The problem with this method is that if I want to recode the only one string, then my new variable will contain a number of NA's. That is a problem if I have a data set with a large amount of strings but only want to recode some of them. – FilipW Nov 17 '16 at 17:10
expand a bit, please. Are you saying in the event that you only wanted to replace Mechanical Engineering with Engineering, but ignore everything else? – Carl Boneri Nov 17 '16 at 17:22
Exactly. Consider if I have a 1000 different factors and only want to rename a few of them, or as in this case, merge different factors together. Then this solution will result in a new variable where every factor that is not explicitly renamed will be N/A. – FilipW Nov 18 '16 at 08:49

score 1 · Answer 3 · answered Sep 20 '22 at 17:06

I had a similar problem where I had a dataframe with lots of columns and one of the columns had about 95 different values. I wanted to create another column that grouped (mapped) the 95 values into something more manageable. I created a simple dataframe that had the mappings to columns as a lookup table.

In your case rather than do this step:

# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
         "Electric Engineering" = "Engineering",
         "Political Science" = "Social Science",
         "Economics" = "Social Science")

You make a simple data frame:

subject = c("Mechanichal Engineering",
             "Electric Engineering",
             "Political Science"",
             "Economics")

category = c("Engineering",
             "Engineering",
             "Social Science",
             "Social Science")

lookup_table = data.frame(subject, category)

You need two libraries to do the hash table and lookup:

library(hash)
library(qdapTools)

Load in a lookup_table dataframe with the two columns that you want to be the hash table:

In this dataframe column 1 will be the key and column 2 is the lookup value.

# make the hash table
h = hash::hash(keys = lookup_table$subject, values = lookup_table$category)

# create the categories for your education values
# find the match of education in your h table and return the value category from the h table

data$ed_category = hash_look(data$education, h, missing = data$education)

Now your output will look like the following; using the missing parameter just uses Other since it's not in the hash table but in the education column.

#>                  education    ed_category
#> 1     Electric Engineering    Engineering
#> 2                    Other          Other
#> 3                    Other          Other
#> 4                    Other          Other
#> 5                    Other          Other
#> 6        Political Science Social Science

FilipW · Answer 4 · 2016-11-14T12:50:32.377

I found that the best way to do this is to use recode() from the car package

# Observe that dplyr also has a recode function, so require car after dplyr
    require(dplyr)
    require(car)

The data are four education categories that are sampled from.

    education <- c("Mechanichal Engineering",
                   "Electric Engineering","Political Science","Economics")

data <- data.frame(ID = c(1:1000), X1 = replicate(1,sample(education,1000,rep=TRUE)))

Using recode() on the data I recode the categories

lut <- data.frame(ID = c(1:1000), X2 = recode(data$X1, '"Economics" = "Social Science";
                         "Electric Engineering" = "Engineering";
                          "Political Science" = "Social Science";
                          "Mechanichal Engineering" = "Engineering"'))

To see if it performed correct, join the original data and the recoded data

data <- full_join(data, lut, by = "ID")

head(data)

   ID                     X1             X2
1  1       Political Science Social Science
2  2               Economics Social Science
3  3    Electric Engineering    Engineering
4  4       Political Science Social Science
5  5               Economics Social Science
6  6 Mechanichal Engineering    Engineering

With recode you don't have to sort the data before recoding it.

Replacing strings with lookup table dplyr

4 Answers4