-1

I have language data containing over 200 languages with some missing values coded as '' (0 length characters).

I would like to compress this using factor to code main languages, and all others as 'other language' while the '' is coded as '(missing)' showing up at the end of the string.

My plan is this:

lanfmt <- list(
  lev = c(prime <- c('English', 'Russian', 'Urdu'), diff <- setdiff(levels(lan), c(prime, '')), ''),
  lab = c(prime, diff, '')
)

table(factor(lan, lanfmt$levels, lanfmt$labels)

but R doesn't like many-to-one formats of factors. How do I aggregate into a single category?

EDIT:

I decided a good solution, using lanfmt as described above is the following:

table(lanfmt$lab[match(lang, lanfmt$lev)])

It's not as elegant, but it works in a pinch.

AdamO
  • 4,283
  • 1
  • 27
  • 39
  • Also have a look at [this question](http://stackoverflow.com/q/10431403). Although the OP has to do with recoding integer values to character, many (all?) of the approaches will work for your case as well. – BenBarnes Nov 14 '12 at 20:49

1 Answers1

1

I think you should convert your factors to character, edit them and then order them. Maybe something like this would help (lan being the language vector of your list / data frame):

lan <- c("English", "Russian", "Urdu", "", "Indonesian")
lan <- factor(lan)
prime <- c("English", "Russian", "Urdu", "missing")
missing <- ""

lan <- as.character(lan)
lan[lan %in% missing] <- "missing"

lan[!lan %in% prime] <- "other language"
lan <- factor(lan)
lan
[1] English        Russian        Urdu           missing       
[5] other language
Levels: English missing other language Russian Urdu

After that you can order your languages

order <- c("English", "Russian", "Urdu", "other language", "missing")
lan <- ordered(lan, order)
dt <- data.frame(lan, stuff=rnorm(5,4,1))
dt[with(dt, order(lan)),]

             lan    stuff
1        English 4.212460
2        Russian 3.681616
3           Urdu 3.409838
5 other language 3.304108
4        missing 3.938468
Mikko
  • 7,530
  • 8
  • 55
  • 92