2

I have a dataframe column with numerous textual values (levels). I need to map those values to a predefined object-like structure in order to reduce the number of levels. I could easily achieve this in Python using a dictionary but could not do the same with a list in R.

For example, my dataframe column is something like:

df <- data.frame(weather = c('Clear','Snow','Clear','Rain','Rain','Other','Hail/sleet','Unknown'))

I need to map this to a list like

weather.levels <- list(
  dry = c('Clear', 'Cloudy'),
  wet = c('Snow', 'Rain', 'Hail/sleet'),
  other = c('Other','Unknown'))

so that my transformed dataframe looks like

    old.weather new.weather
1       Clear         dry
2        Snow         wet
3       Clear         dry
4        Rain         wet
5        Rain         wet
6      Other1       other
7  Hail/sleet         wet
8     Unknown       other

I have looked at solutions like this and this, but these do not answer my question. I cannot create a dataframe to use R's match function because the number of levels in each category of the preset dictionary weather.levels ('dry', 'wet', 'other') are different.

emphasent
  • 73
  • 2
  • 9

6 Answers6

4

As there often is, there is a base R function designed to do exactly this. levels<- is what you want:

Note that the df$weather variable needs to be a factor variable for this to work appropriately (the below code without explicitly changing to a factor first worked pre R 4.0 because df$weather was a factor by default in the data.frame call).

df$new.weather <- `levels<-`(df$weather, weather.levels)
## if variable not already a factor, instead:
df$new.weather <- `levels<-`(factor(df$weather), weather.levels)
df
#     weather new.weather
#1      Clear         dry
#2       Snow         wet
#3      Clear         dry
#4       Rain         wet
#5       Rain         wet
#6      Other       other
#7 Hail/sleet         wet
#8    Unknown       other

In a slightly longer but simpler to read form this is equivalent to:

df$new.weather <- df$weather
levels(df$new.weather) <- weather.levels
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • 1
    Great solution! Had no idea you could use a named list as a value for levels. – Shree Dec 05 '18 at 01:13
  • 1
    I was (cautiously) dubious about "designed to do exactly this" until I read the man page: *"or a named list specifying how to rename the levels."*. Another day, another something learned. Be honest, thelatemail: how often do you use this aspect of it? – r2evans Dec 05 '18 at 05:09
  • 1
    @r2evans - honestly, only occasionally. In my line of work much larger reference tables already exist that I just do a merge against. But this is a quick method that I throw in now and then in interactive use. It is a cracker of a function though. – thelatemail Dec 05 '18 at 05:13
  • @thelatemail This is so neat and easy! I'm starting to love R more now :-) – emphasent Dec 08 '18 at 21:59
  • @emphasent - glad to hear it! It's an incredible tool once you get rolling with it :-) – thelatemail Dec 08 '18 at 23:01
2

Please note that both of the solutions in the accepted answer don't work for R-4.0.2 anymore. The new.weather returned the same values as the weather column.

df <- data.frame(weather = c('Clear','Snow','Clear','Rain','Rain','Other','Hail/sleet','Unknown'))
weather.levels <- list(
  dry = c('Clear', 'Cloudy'),
  wet = c('Snow', 'Rain', 'Hail/sleet'),
  other = c('Other','Unknown'))
df$new.weather <- `levels<-`(df$weather, weather.levels)

Result in R-4.0.2

The same snippet works for R-3.5.2.

Result in R-3.5.2

Elisa Yao
  • 21
  • 2
  • 1
    The answer still works, the difference is that the `data.frame(...)` call when the original question was written resulted in `df$weather` being a `factor`, while now it's a `character` due to a change in the `stringsAsFactors=` default argument - https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html . Hence you need to do `df$new.weather <- \`levels<-\`(factor(df$weather), weather.levels)` now. – thelatemail Jun 08 '22 at 20:56
1

Here's one way using dplyr -

weather.levels %>% 
  unlist() %>% 
  data_frame(new.weather = gsub("[0-9]", "", names(.)), old.weather = .) %>% 
  left_join(df, ., by = c("weather" = "old.weather"))

     weather new.weather
1      Clear         dry
2       Snow         wet
3      Clear         dry
4       Rain         wet
5       Rain         wet
6      Other       other
7 Hail/sleet         wet
8    Unknown       other
Shree
  • 10,835
  • 1
  • 14
  • 36
  • Thanks @Shree, this is much easier than I had imagined, though not the cleanest method. This replaces all instances of integers in the level names, which may cause issues with names with integers, although there may be some regexp workaround for that too. – emphasent Dec 08 '18 at 21:43
1

There are three easy methods. Up front, I'm going to modify the data slightly (remove "Other") to highlight one strength of one of the methods.

df <- data.frame(weather = c('Clear','Snow','Clear','Rain','Rain','Other','Hail/sleet','Unknown'))
weather.levels <- list(
  dry = c('Clear', 'Cloudy'),
  wet = c('Snow', 'Rain', 'Hail/sleet'),
  other = c('Unknown'))

Simple Lookup

levels1 <- c(Unknown="other",Snow="wet",Rain="wet","Hail/sleet"="wet",Clear="dry",Cloudy="dry")
### levels1 <- setNames(rep(names(weather.levels), lengths(weather.levels)), unlist(weather.levels))
transform(df, newwx = levels1[as.character(weather)])
#      weather newwx
# 1      Clear   dry
# 2       Snow   wet
# 3      Clear   dry
# 4       Rain   wet
# 5       Rain   wet
# 6      Other  <NA>
# 7 Hail/sleet   wet
# 8    Unknown other

(I'm using transform which is base-R, but you can easily use dplyr and such if you're more comfortable.)

Table Merge

This is essentially what Shree's answer does (though the concept is not just dplyr and friends).

df2 <- data.frame(wxfrom = names(levels1), wxto = levels1, stringsAsFactors=FALSE, row.names=NULL)
merge(df, df2, by.x="weather", by.y="wxfrom", all.x=TRUE)
#      weather  wxto
# 1      Clear   dry
# 2      Clear   dry
# 3 Hail/sleet   wet
# 4      Other  <NA>
# 5       Rain   wet
# 6       Rain   wet
# 7       Snow   wet
# 8    Unknown other

Similar to:

dplyr::left_join(df, df2, by=c("weather"="wxfrom"))

Lookup With Default

transform(df, newwx = levels1[ match(as.character(weather), names(levels1), nomatch=1L) ])
#      weather newwx
# 1      Clear   dry
# 2       Snow   wet
# 3      Clear   dry
# 4       Rain   wet
# 5       Rain   wet
# 6      Other other
# 7 Hail/sleet   wet
# 8    Unknown other

This last one has the innate ability to assign an unknown to any non-matches. With the others, it is as simple as doing ifelse(is.na(newwx), "unk", newwx), so it doesn't add a whole lot.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • thanks @r2evans. Is there any way the `levels1` list could be obtained from the `weather.levels` list because the latter seems prettier and easier to write? – emphasent Dec 08 '18 at 21:50
0

Code :

sapply(df$weather, function(w) names(weather.levels[sapply(lapply(weather.levels, function(y) lapply(y, function(x) w %in% x)), function(z) any(z))]))

Then cbind it to df and change col names.

Özgür Can Karagöz
  • 1,039
  • 1
  • 13
  • 32
Brian Syzdek
  • 873
  • 6
  • 10
0

Using dplyr::case_when:

df %>% 
  dplyr::mutate(new.weather = dplyr::case_when(
    weather %in% weather.levels$dry   ~ "dry",
    weather %in% weather.levels$wet   ~ "wet",
    weather %in% weather.levels$other ~ "other",
    TRUE                              ~ "unrecognized"
  ))
Josh
  • 1,210
  • 12
  • 30