I have language data containing over 200 languages with some missing values coded as '' (0 length characters).
I would like to compress this using factor
to code main languages, and all others as 'other language' while the '' is coded as '(missing)' showing up at the end of the string.
My plan is this:
lanfmt <- list(
lev = c(prime <- c('English', 'Russian', 'Urdu'), diff <- setdiff(levels(lan), c(prime, '')), ''),
lab = c(prime, diff, '')
)
table(factor(lan, lanfmt$levels, lanfmt$labels)
but R doesn't like many-to-one formats of factors. How do I aggregate into a single category?
EDIT:
I decided a good solution, using lanfmt
as described above is the following:
table(lanfmt$lab[match(lang, lanfmt$lev)])
It's not as elegant, but it works in a pinch.