0

I have a problem where my data.frame consist of different attributes due to different data source. For example, the state column is actually of the same states but in different representation. Note that my actual data is not using US states.

    df <- data.frame(Names=c("Adam", "Mark", "Dahlia", "Jeff", "Derek", 
                             "Arnold", "Sheppard", "Dwayne", "Nichols", "Shane"), 
                     Age=c(27, 28, 29, 37, 26, 22, 29, 34, 31, 30), 
                     States=c("AL", "Alaska", "Alabama", "WI", 
                              "Wisconsin", "AZ", "Arizona", "AL", "WI", "AK"))

I am trying to recode values like AL, WI, AZ, and AK as Alabama, Wisconsin, Arizona, and Alaska respectively.

So far I came across:

    case_when(

        df$States == "AL" ~ "Alabama",
        df$States == "AK" ~ "Alaska",
        df$States == "WI" ~ "Wisconsin",
        df$States == "AZ" ~ "Arizona",
    )

and it gives me output:

     [1] "Alabama"   NA          NA          "Wisconsin" NA    "Arizona" NA         
     [8] "Alabama"   "Wisconsin" "Alaska"

I don't want the NA value so what I did is:

    case_when(

      df$States == "AL" ~ "Alabama",
      df$States == "Alabama" ~ "Alabama",
      df$States == "AK" ~ "Alaska",
      df$States == "Alaska" ~ "Alaska",
      df$States == "WI" ~ "Wisconsin",
      df$States == "Wisconsin" ~ "Wisconsin",
      df$States == "AZ" ~ "Arizona",
      df$States == "Arizona" ~ "Arizona",

    )

It gives me the output I want but I think there is much more simpler way to do this.

I'm thinking about loop because later I would like to turn it into pseudo-code. However, I'm running out of ideas on how to execute this. Really appreciate everyone helps out here.

Thank you.

mojek
  • 195
  • 1
  • 9

2 Answers2

4

You can use dplyr's recode function together with a named vector. I use setNames to create a named character vector (something like a key-value pair), but you can make your vector out of whatever data you have. Using your example, we can set up some keys and values:

keys <- state.abb # the abbreviations you want to replace
vals <- state.name # the replacement values
keysvals <- setNames(vals, keys) # create named vector

Now call recode. Make sure you use the !!! to unquote and splice:

library(dplyr)

df$States <- recode(df$States, !!!keysvals)

Which will return:

      Names Age    States
1      Adam  27   Alabama
2      Mark  28    Alaska
3    Dahlia  29   Alabama
4      Jeff  37 Wisconsin
5     Derek  26 Wisconsin
6    Arnold  22   Arizona
7  Sheppard  29   Arizona
8    Dwayne  34   Alabama
9   Nichols  31 Wisconsin
10    Shane  30    Alaska
  • Thanks @gersht. This really helps. My other question, is the `keys` and `vals` have to be in the same length? Because, my data is that I would like to change let say for example `MY` or `Malaysian` into `Malaysia` and `SG` and `Singaporean` into `Singapore`. My concern is that, sometimes there are multiple attributes in `keys` that using different abbreviation but I want to change it into one same names. – mojek Jul 12 '19 at 08:56
  • 2
    @mojek they do have to be the same length, but given your example that shouldn't be a problem. All you have to do is add multiple key-value pairs for each possible key, where the value is simply duplicated. For example you can do `keysvals <- c(MY=Malaysia, Malaysian=Malaysia, SG = Singapore, Singaporean = Singapore)`. –  Jul 12 '19 at 09:27
0

If you are planning to match with US state names we can use the inbuilt vector state.abb and state.name to match and replace.

inds <- match(df$States, state.abb)
df$States[which(!is.na(inds))] <- state.name[na.omit(inds)]

df
#       Names Age   States
#1      Adam  27   Alabama
#2      Mark  28    Alaska
#3    Dahlia  29   Alabama
#4      Jeff  37 Wisconsin
#5     Derek  26 Wisconsin
#6    Arnold  22   Arizona
#7  Sheppard  29   Arizona
#8    Dwayne  34   Alabama
#9   Nichols  31 Wisconsin
#10    Shane  30    Alaska

Also you can reduce the length of case_when by using %in% which can compare multiple vectors instead of only one using ==

library(dplyr)

df %>%
  mutate(States = case_when(States %in% c("AL", "Alabama") ~ "Alabama", 
                            States %in% c("AK", "Alaska")~ "Alaska", 
                            States %in% c("WI", "Wisconsin")~ "Wisconsin", 
                            States %in% c("AZ", "Arizona")~ "Arizona", 
                             TRUE ~ NA_character_))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you. But my data actually not using US states. – mojek Jul 12 '19 at 06:30
  • 1
    @mojek ok..in that case you can then use the `case_when` answer. Also this is quite similar to https://stackoverflow.com/questions/7547597/dictionary-style-replace-multiple-items see if you could use any of the answer from there. – Ronak Shah Jul 12 '19 at 06:34