programatically create new variables which are sums of nested series of other variables

Question

I have data giving me the percentage of people in some groups who have various levels of educational attainment:

df <- data_frame(group = c("A", "B"),
             no.highschool = c(20, 10),
             high.school = c(70,40),
             college = c(10, 40),
             graduate = c(0,10))

df
    # A tibble: 2 x 5
  group no.highschool high.school college graduate
  <chr>         <dbl>       <dbl>   <dbl>    <dbl>
1 A               20.         70.     10.       0.
2 B               10.         40.     40.      10.

E.g., in group A 70% of people have a high school education.

I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).

desired df would be:

desired.df <- data.frame(group = c("A", "B"),
                     no.highschool = c(20, 10),
                     high.school = c(70,40),
                     college = c(10, 40),
                     graduate = c(0,10),
                     lessthan_no.highschool = c(0,0),
                     lessthan_high.school = c(20, 10),
                     lessthan_college = c(90, 50),
                     lessthan_graduate = c(100, 90))

In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?

I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...

there isn't a level below no.highschool, therefore lessthan_no.highschool will always be 0. — lost, Aug 26 '18 at 03:52
In `desired.df` you have the variable `less.than.hs`. Shouldn't it be `no.highschool`? — Rui Barradas, Aug 26 '18 at 04:27
@lost Gregor beat me to it, in your desired result you repeat the variables of your input so their names should be the same. One of them is not. I assumed this to be a typo. Oh, and I had missed the part about `tidyverse` so I was busy coding a base R way. Would that be of interest? — Rui Barradas, Aug 26 '18 at 04:37
@RuiBarradas, a base R method isn't of interest to me right now, but if you've already started on it perhaps it might be to someone else who finds this later :) — lost, Aug 26 '18 at 04:46

score 2 · Answer 1 · answered Aug 26 '18 at 05:09

Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.

tmp <- apply(df[-1], 1, function(x){
    s <- cumsum(x)
    100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))

desired.df
#  group no.highschool high.school college graduate lessthan_no.highschool
#1     A            20          70      10        0                      0
#2     B            10          40      40       10                      0
#  lessthan_high.school lessthan_college lessthan_graduate
#1                   20               90               100
#2                   10               50                90

score 1 · Accepted Answer · answered Aug 26 '18 at 04:29

how could I do this programatically (and elegantly) using tidyverse tools?

Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.

library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
  mutate(education = factor(education, levels = names(df)[-1])) %>%
  group_by(group) %>%
  mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
  arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups:   group [2]
#   group education         n lessthan_x
#   <chr> <fct>         <dbl>      <dbl>
# 1 A     no.highschool    20          0
# 2 A     high.school      70         20
# 3 A     college          10         90
# 4 A     graduate          0        100
# 5 B     no.highschool    10          0
# 6 B     high.school      40         10
# 7 B     college          40         50
# 8 B     graduate         10         90

This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

it's intentionally in a non-tidy format. It's in this format because it's going to be joined to individual-level data which is in tidy format, and which will be used for modeling, etc. — lost, Aug 26 '18 at 04:34
Right, great. But if you want to "elegantly" use `tidyverse` tools, you'll need to tidy it first. And the questions I linked to should help you get it back to the non-tidy format you need. I don't think I need to repeat those answers here. If there are changes or updates, those questions which are focused on that piece of the problem should get updated. — Gregor Thomas, Aug 26 '18 at 04:37
This works, though the order of variables is different from in OP: desired.df <- tidy_result %>% select(-n) %>% mutate(education = paste0("lessthan_", education)) %>% spread(education, lessthan_x) %>% right_join(df) — lost, Aug 26 '18 at 05:02

programatically create new variables which are sums of nested series of other variables

2 Answers2