0

I have 8 age categories, with each of them having its own column (i.e. residents_under_5, residents_6_to_12, etc. Each column has a value between 0 and 3, for the number of people in that household in that specific age category. What I want is a new column with which I can plot the total distribution of age of my population on a histogram. So I was thinking of a column that has 66 rows of residents_under_5, 32 rows of residents_6_to_12, etc., for the sum of those categories.

My data looks like this:

a b c d 
0 3 2 1
1 3 2 1
2 0 2 1
3 1 0 0

What I want is a column e that shows:

e
a
a
a
a
b
b
b
b
b
c
c
c
d
d
d

For the total number of occurences in the other columns.

I've tried declaring new columns with sum(residents_under_5), but that will give me 1 row with 66 (as the sum of that category). I can't plot a histogram with such a column. I hope someone can figure it out!

This is the dput() of the relevant columns

residents_under_5 = c(0, 0, 0, 1, 1, 2), 
residents_6_to_12 = c(0, 0, 0, 0, 0, 0), 
        residents_13_to_18 = c(0, 0, 0, 0, 0, 0), 
residents_19_to_24 = c(0, 
        0, 0, 0, 0, 0), 
residents_25_to_34 = c(0, 1, 2, 0, 1, 0), 
       residents_35_to_49 = c(0, 0, 0, 2, 1, 2), 
residents_50_to_64 = c(0, 
        1, 0, 0, 0, 0), 
residents_65_and_older = c(2, 0, 0, 0, 1, 
        0)
Ben Gill
  • 25
  • 6

2 Answers2

1

You can unlist the dataframe and calculate frequencies using table and then repeat letters using rep.

rep(letters[seq_len(ncol(df))], colSums(df))

data

df <- data.frame(residents_under_5 = c(0, 0, 0, 1, 1, 2), 
                 residents_6_to_12 = c(0, 0, 0, 0, 0, 0), 
                 residents_13_to_18 = c(0, 0, 0, 0, 0, 0), 
                 residents_19_to_24 = c(0, 0, 0, 0, 0, 0), 
                 residents_25_to_34 = c(0, 1, 2, 0, 1, 0), 
                 residents_35_to_49 = c(0, 0, 0, 2, 1, 2), 
                 residents_50_to_64 = c(0, 1, 0, 0, 0, 0), 
                 residents_65_and_older = c(2, 0, 0, 0, 1, 0))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Does this also work for data table? I have more columns than just these 4, how do I specify only the relevant columns? Do I just write 'names', or what should go instead? thanks! – Ben Gill May 27 '19 at 13:25
  • @BenGill This is taking into consideration all the columns whether 4 or 100.You might need to subset the columns If you want to ignore them. For example, to ignore 1st column we might do `rep(names(df)[-1], table(unlist(df)[-1]))` – Ronak Shah May 27 '19 at 13:28
  • Apparently it doesn't work as I have negative values in my data set (error: invalid 'times' argument) – Ben Gill May 27 '19 at 13:37
  • @BenGill I create a new dataframe using the `dput` of your relevant columns and updated the answer, Can you check now? – Ronak Shah May 27 '19 at 13:44
  • it gives me 5000 rows of "a" – Ben Gill May 27 '19 at 13:50
  • @BenGill I tried re-reading your question and updated the answer. Make sure that `df` has only relevant columns. See if that works for you or else I'll delete the post since I am not able to help you. – Ronak Shah May 27 '19 at 13:57
  • Thank you so much! It worked now. I first created a new data table with only the relevant columns. Your code gave me a vector with the right values. I declared it into a new variable as data frame, and will now plot it on a histogram! – Ben Gill May 27 '19 at 15:42
1

An option in tidyverse would be to get the sum of all columns with summarise_all, gather into 'long' format and uncount the 'value' column

library(tidyverse)
df1 %>% 
   summarise_all(sum) %>%
   gather %>% 
   uncount(value)

data

df1 <- structure(list(a = 0:3, b = c(3L, 3L, 0L, 1L), c = c(2L, 2L, 
2L, 0L), d = c(1L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, 
  -4L))
akrun
  • 874,273
  • 37
  • 540
  • 662