Categorizing variable with multiple values in one cell in R

Question

I am new to coding in R and posting here, so pls let me know if I can add anything helpful. I am trying to create a new categorical variable "income" (3 levels) that categorizes a subset of predetermined countries (x, y, z) into the different levels. My issue is that the countries variable has multiple countries in each cell, so I don't know how to sort this.

ID           countries        **income**
1            x, y, z          LMIC, HMIC, UMIC
2            y                HMIC
3            x, z             LMIC, UMIC
1            z                UMIC

This is the code I have, but it is only working on the rows that only have one country, ie the rows with x, y, z remain unchanged. My ultimate goal is to create the income variable and be able to get a total count of how many ID's fall into each income category.

data.set$countries <- revalue(data.set$income, c("x"=1, "y"=2, "z"=3))

data.set$income[dataset$countries == 1] <- "LMIC"
data.set$income[dataset$countries == 2] <- "HMIC"
data.set$income[dataset$countries == 3] <- "UMIC"

score 0 · Answer 1 · answered Jul 20 '22 at 02:42

0

One option is to first separate countries into individual rows, then we can use recode to assign the income category. Then, we can use tally to get a count of the IDs in each income category.

library(tidyverse)

df %>%
  separate_rows(countries, sep =", ") %>%
  mutate(income = recode(countries, "x" = "LMIC", "y" = "HMIC", "z" = "UMIC")) %>% 
  group_by(income) %>% 
  tally(ID)

Output

  income     n
  <chr>  <int>
1 HMIC       3
2 LMIC       4
3 UMIC       5

Data

df <- structure(list(ID = c(1L, 2L, 3L, 1L), countries = c("x, y, z", 
"y", "x, z", "z")), class = "data.frame", row.names = c(NA, -4L
))

answered Jul 20 '22 at 02:42

AndrewGB

16,126
5
18
49

1

Hi, I had a follow up question! I followed your code, which turned n=134,086 observations (publications) into n=388,844 observations; thus, when I try to tally the categories, I get values like HMIC = 305,000. My goal is to have the tally amongst the original publication count so I can calc what proportion of n=134k were HMIC,LMIC, etc. This would mean that the Income cells (by ID) that have for example "HMIC, HMIC, LMIC" would only count as one HMIC and one LMIC – aml129 Jul 21 '22 at 20:44
@aml129 It's hard to know without having the structure of your data or the expected output. You can use `dput()` to provide your data. You can also use `head()` to provide just a few rows of data, so put `dput(head(data.set))` into the console, then you can paste the result into your question. Then, provide what you want the expected output to be for your sample. Then, I can adjust my answer. You can also see [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more info. – AndrewGB Jul 22 '22 at 05:31

Categorizing variable with multiple values in one cell in R

1 Answers1