6

I am trying to create a look up table in R in order to get my data in the same format as the company that I am working for.

It regards different education categories that I want to merge using dplyr.

library(dplyr)

# Create data
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")

    data <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))

    tbl_df(data)

    # Create lookup table
    lut <- c("Mechanichal Engineering" = "Engineering",
             "Electric Engineering" = "Engineering",
             "Political Science" = "Social Science",
             "Economics" = "Social Science")

    # Assign lookup table
    data$X1 <- lut[data$X1]

But in my output my old values are replace with the wrong ones, i.e. not the ones that I created in the lookup table. Rather it seems like the lookup table is assign randomly.

FilipW
  • 1,412
  • 1
  • 13
  • 25
  • 1
    Please include `dput(data)` in your question. – nrussell Jul 08 '15 at 14:51
  • 3
    More generally, make sure the sample code you provide can be run by others (keep comments to after `#`, don't change capitalization in data/Data). Here's a guide http://stackoverflow.com/a/28481250/1191259 – Frank Jul 08 '15 at 14:59
  • 1
    What about `Data$education <- lut[as.character(data$education)]` – MrFlick Jul 08 '15 at 14:59
  • @Frank thanks. fixed. but agree that a reproducible example should be in the question – MrFlick Jul 08 '15 at 15:00
  • Perhaps using a join to a map of your preferred values might even be more efficient? – leerssej Nov 12 '16 at 08:40

4 Answers4

3

I've just been trying to figure this out myself. I wasn't quite happy with most of the solutions I found, so here's what I ended up with. I added an "other" category to show that it works even if there are values not defined in the lookup table.

library(dplyr)

# Create data
education <- c("Mechanichal Engineering",
               "Electric Engineering",
               "Political Science",
               "Economics",
               "Other")

data <- data.frame(X1 = replicate(1, sample(education, 20, rep=TRUE)))

# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
         "Electric Engineering" = "Engineering",
         "Political Science" = "Social Science",
         "Economics" = "Social Science")

data %>%
    mutate(X2 = recode(X1, !!!lut))
#>                         X1             X2
#> 1     Electric Engineering    Engineering
#> 2                    Other          Other
#> 3                    Other          Other
#> 4                    Other          Other
#> 5                    Other          Other
#> 6        Political Science Social Science
#> 7                    Other          Other
#> 8                Economics Social Science
#> 9        Political Science Social Science
#> 10    Electric Engineering    Engineering
#> 11               Economics Social Science
#> 12               Economics Social Science
#> 13 Mechanichal Engineering    Engineering
#> 14               Economics Social Science
#> 15       Political Science Social Science
#> 16                   Other          Other
#> 17                   Other          Other
#> 18                   Other          Other
#> 19 Mechanichal Engineering    Engineering
#> 20       Political Science Social Science
Oliver
  • 1,098
  • 1
  • 11
  • 16
2
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
lut <- list("Mechanichal Engineering" = "Engineering",
            "Electric Engineering" = "Engineering",
            "Political Science" = "Social Science",
            "Economics" = "Social Science")
lut2<-melt(lut)
data1 <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
data1$new <- lut2[match(data1$X1,lut2$L1),'value']
head(data1)


=======================  ==============
X1                       new           
=======================  ==============
Political Science        Social Science
Political Science        Social Science
Mechanichal Engineering  Engineering   
Mechanichal Engineering  Engineering   
Political Science        Social Science
Political Science        Social Science
=======================  ==============
Carl Boneri
  • 2,632
  • 1
  • 13
  • 15
  • The problem with this method is that if I want to recode the only one string, then my new variable will contain a number of NA's. That is a problem if I have a data set with a large amount of strings but only want to recode some of them. – FilipW Nov 17 '16 at 17:10
  • expand a bit, please. Are you saying in the event that you only wanted to replace Mechanical Engineering with Engineering, but ignore everything else? – Carl Boneri Nov 17 '16 at 17:22
  • Exactly. Consider if I have a 1000 different factors and only want to rename a few of them, or as in this case, merge different factors together. Then this solution will result in a new variable where every factor that is not explicitly renamed will be N/A. – FilipW Nov 18 '16 at 08:49
1

I had a similar problem where I had a dataframe with lots of columns and one of the columns had about 95 different values. I wanted to create another column that grouped (mapped) the 95 values into something more manageable. I created a simple dataframe that had the mappings to columns as a lookup table.

In your case rather than do this step:

# Create lookup table
lut <- c("Mechanichal Engineering" = "Engineering",
         "Electric Engineering" = "Engineering",
         "Political Science" = "Social Science",
         "Economics" = "Social Science")

You make a simple data frame:

subject = c("Mechanichal Engineering",
             "Electric Engineering",
             "Political Science"",
             "Economics")

category = c("Engineering",
             "Engineering",
             "Social Science",
             "Social Science")

lookup_table = data.frame(subject, category)

You need two libraries to do the hash table and lookup:

library(hash)
library(qdapTools)

Load in a lookup_table dataframe with the two columns that you want to be the hash table:

In this dataframe column 1 will be the key and column 2 is the lookup value.

# make the hash table
h = hash::hash(keys = lookup_table$subject, values = lookup_table$category)

# create the categories for your education values
# find the match of education in your h table and return the value category from the h table

data$ed_category = hash_look(data$education, h, missing = data$education)

Now your output will look like the following; using the missing parameter just uses Other since it's not in the hash table but in the education column.

#>                  education    ed_category
#> 1     Electric Engineering    Engineering
#> 2                    Other          Other
#> 3                    Other          Other
#> 4                    Other          Other
#> 5                    Other          Other
#> 6        Political Science Social Science
Bryan Butler
  • 1,750
  • 1
  • 19
  • 19
0

I found that the best way to do this is to use recode() from the car package

# Observe that dplyr also has a recode function, so require car after dplyr
    require(dplyr)
    require(car)

The data are four education categories that are sampled from.

    education <- c("Mechanichal Engineering",
                   "Electric Engineering","Political Science","Economics")

data <- data.frame(ID = c(1:1000), X1 = replicate(1,sample(education,1000,rep=TRUE)))

Using recode() on the data I recode the categories

lut <- data.frame(ID = c(1:1000), X2 = recode(data$X1, '"Economics" = "Social Science";
                         "Electric Engineering" = "Engineering";
                          "Political Science" = "Social Science";
                          "Mechanichal Engineering" = "Engineering"'))

To see if it performed correct, join the original data and the recoded data

data <- full_join(data, lut, by = "ID")

head(data)

   ID                     X1             X2
1  1       Political Science Social Science
2  2               Economics Social Science
3  3    Electric Engineering    Engineering
4  4       Political Science Social Science
5  5               Economics Social Science
6  6 Mechanichal Engineering    Engineering

With recode you don't have to sort the data before recoding it.

FilipW
  • 1,412
  • 1
  • 13
  • 25