0

I want to calculate the median age for each group in this data frame:

dfx<-data.frame(group=c(1:100),
               `1`=rnorm(100,50,0.5),
               `2`=rnorm(100,45,15),
               `3`=rnorm(100,17,5))
colnames(dfx) <- c("group", "1","2","3")

Which shows the number of individuals aged 1 - 3 for different groups (the real dataframe has over 100 columns and runs from age 0 - 90+).

# A tibble: 100 × 4
   group   `1`   `2`   `3`
   <int> <dbl> <dbl> <dbl>
 1     1  49.7  39.5 15.2 
 2     2  49.0  41.4 20.3 
 3     3  49.5  74.8  8.31
 4     4  50.0  34.4 18.2 
 5     5  49.9  41.5 17.1 
 6     6  49.7  40.4 21.0 
 7     7  49.6  72.6 23.8 
 8     8  50.4  41.9 14.9 
 9     9  50.3  63.8 17.8 
10    10  50.0  34.7 26.2 
# … with 90 more rows

I want to calculate the median age for each group to produce something like this:

dfxx<-data.frame(group=c(1:100),
                med_age=rnorm(100,2,0.0001))

Note that these are meant to be integers, just like the original age data.

# A tibble: 100 × 2
   group med_age
   <int>   <dbl>
 1     1    2.00
 2     2    2.00
 3     3    2.00
 4     4    2.00
 5     5    2.00
 6     6    2.00
 7     7    2.00
 8     8    2.00
 9     9    2.00
10    10    2.00
# … with 90 more rows

I assume I need to make a vector/histogram from the data before calculating the median, but I'm not sure how to do this.

Thanks in advance!

uke
  • 462
  • 1
  • 11
Loz
  • 137
  • 8

1 Answers1

0

After some research, I found a way to do this with dplyr.

First of all, to understand, I recommend that you read the chapter tidy data in R for Data Science.

Basically, I start with tidying the data in a way that every observation has its own row, and every variable its own column. Because "age" is actually a variable and the ages 1-3 are only its expressions, we want to rearrange our dataset like this:

library(tidyverse)
dfx <- pivot_longer(dfx, cols = c(2:4), names_to = "age", values_to = "frequency")
dfx$age <- as.numeric(dfx$age)

Afterwards, it looks a lot tidyer:

> dfx
# A tibble: 300 × 3
   group   age frequency
   <int> <dbl>     <dbl>
 1     1     1      50.1
 2     1     2      39.0
 3     1     3      18.0
 4     2     1      49.4
 5     2     2      32.3
 6     2     3      16.5
 7     3     1      50.1
 8     3     2      57.2
 9     3     3      22.0
10     4     1      50.5
# … with 290 more rows

The next step is to compute the median age of each group. For example, to get the median of group 1 we need all observations (= rows) that have the value 1 in the group column. We can use group_by to group them by their group.

To calculate the median, like you suggested, we need to reverse engineer the frequency table to original data. This happens by printing the age as many times in a vector as their frequency states. We use rep for this. It takes two arguments: The vector which will be repeated, and how many times it will be repeated.

You can look this up in the data transformation chapter of R for Data Science. The rep solution is from this post.

The code looks as simple as this:

med <- dfx %>%
  group_by(group) %>%
  summarise(median_age = median(rep(age,frequency)))

And the results look like this:

> med
# A tibble: 100 × 2
   group median_age
   <int>      <dbl>
 1     1          2
 2     2          2
 3     3          2
 4     4          2
 5     5          1
 6     6          2
 7     7          2
 8     8          2
 9     9          1
10    10          1
# … with 90 more rows

P.S. I did not check how it behaves with the non-integer values in the frequency count, because I think your original data will most likely contain only integer frequency counts.

uke
  • 462
  • 1
  • 11