coding multiple levels with single label with factor

Question

I have language data containing over 200 languages with some missing values coded as '' (0 length characters).

I would like to compress this using factor to code main languages, and all others as 'other language' while the '' is coded as '(missing)' showing up at the end of the string.

My plan is this:

lanfmt <- list(
  lev = c(prime <- c('English', 'Russian', 'Urdu'), diff <- setdiff(levels(lan), c(prime, '')), ''),
  lab = c(prime, diff, '')
)

table(factor(lan, lanfmt$levels, lanfmt$labels)

but R doesn't like many-to-one formats of factors. How do I aggregate into a single category?

EDIT:

I decided a good solution, using lanfmt as described above is the following:

table(lanfmt$lab[match(lang, lanfmt$lev)])

It's not as elegant, but it works in a pinch.

Also have a look at [this question](http://stackoverflow.com/q/10431403). Although the OP has to do with recoding integer values to character, many (all?) of the approaches will work for your case as well. — BenBarnes, Nov 14 '12 at 20:49

Mikko · Accepted Answer · 2012-11-14T20:43:24.423

I think you should convert your factors to character, edit them and then order them. Maybe something like this would help (lan being the language vector of your list / data frame):

lan <- c("English", "Russian", "Urdu", "", "Indonesian")
lan <- factor(lan)
prime <- c("English", "Russian", "Urdu", "missing")
missing <- ""

lan <- as.character(lan)
lan[lan %in% missing] <- "missing"

lan[!lan %in% prime] <- "other language"
lan <- factor(lan)
lan
[1] English        Russian        Urdu           missing       
[5] other language
Levels: English missing other language Russian Urdu

After that you can order your languages

order <- c("English", "Russian", "Urdu", "other language", "missing")
lan <- ordered(lan, order)
dt <- data.frame(lan, stuff=rnorm(5,4,1))
dt[with(dt, order(lan)),]

             lan    stuff
1        English 4.212460
2        Russian 3.681616
3           Urdu 3.409838
5 other language 3.304108
4        missing 3.938468

This isn't what I wanted but it did give me the idea. Thanks! — AdamO, Nov 14 '12 at 20:47

coding multiple levels with single label with factor

1 Answers1