2

Problem

I have changed the presentation of my question as there seems to be a lack of clarity.

So, we have thousands of hospitals. Their patients are between the ages of 0 and 100 years old. For each age, they have certain numbers of patients e.g. Hospital1 has 10 patients who are 1 year old, 12 patients who are two-year olds, and 0 patients that are a 100 years old etc.

enter image description here

The above dataset is a small and simplified example, my actual dataset contains data for thousands of hospitals and millions of patients.

Outcome sought

I want to know the median patient age of each hospital.

Solution so far

Expand the table so that there is a separate row for each patient's age and then take the median. This will result in my table having hundreds of millions of rows, so is undesirable.

library(dplyr)

## table
hospital <- c(rep(1:3, each = 10))
patient_age <- c(rep(seq(0, 90, by = 10), 3))
number_patients <- round(runif(30, 0, 100),0)
df <- bind_cols(hospital, patient_age, number_patients)
colnames(df) <- c("hospital", "patient_age", "number_patients")

## my impractical solution
df1 <- filter(df, hospital == 1)
df1a <- rep(df1$patient_age, df1$number_patients)
median(df1a)

## there's no way I can repeat this for each hospital (there are 1000s) 
Namra
  • 359
  • 1
  • 10
  • 1
    `aggregate(cbind(age, number)~ rank, df, median)` gives median value for both `age` and `number` for each `rank`. What are you looking for? – Ronak Shah May 17 '21 at 11:48
  • "number" represents the number of people who are that age, so in my example above, the first row shows that there are 59 one-year olds, the second row shows 44 two-year olds. I want to know the median age of all the people in each rank. – Namra May 17 '21 at 12:44
  • @Namra I just saw this question, and I don't agree that it is a duplicate of the link provided. The link provided just calculates a simple median on a list of non unique numbers (i.e. not summarized such as the value `x` can appear 10 times), this question asks about calculating the median based on list of unique numbers, each with a count (`x` appears once, and the count column is 10 (referencing the `x` above). I also understand that expanding 10 values of `x` (`x` count = 10), is not scalable. This is asking a different question than the link. – steveb Apr 12 '22 at 19:40
  • How about a weighted median? `Hmisc` has a weighted quantile function – camille Apr 12 '22 at 20:37
  • @camille thanks for suggesting that! One has to know what to search for with this question, and it is easy to not use the right keyword (i.e. weighted quantile). If you want to add this as an answer (since you suggested it) the feel free to. I can add it if not. The SO link addressing this is [Is there a weighted.median function?](https://stackoverflow.com/questions/2748725/is-there-a-weighted-median-function). – steveb Apr 12 '22 at 21:24
  • I did find that this question could be considered a duplicate of [Is there a weighted.median() function?](https://stackoverflow.com/questions/2748725/is-there-a-weighted-median-function). I am suggesting now that the "duplicate" link change to this. This question is still helpful as this is the one that came up in the google search, not the one in the link (i.e. it depends on the keyword search). – steveb Apr 12 '22 at 22:14

1 Answers1

2

EDIT:

Here's how you calculate the mean patient age by hospital:

df %>%
  group_by(hospital) %>%
  summarise(
    mean_age = sum(patient_age*number_patients)/sum(number_patients)
    )

or simply:

df %>%
  group_by(hospital) %>%
  summarise(
    mean_age = mean(rep(patient_age,number_patients))
  )

Here's the medians:

df %>%
  group_by(hospital) %>%
  summarise(
    median_age = sort(rep(patient_age,number_patients))[length(rep(patient_age,number_patients))/2]
  )

Here, we subset sort(rep(patient_age,number_patients)) on its middle value, which is length(rep(patient_age,number_patients))/2

EDIT 2:

or simply:

df %>%
  group_by(hospital) %>%
  summarise(
    median_age = median(rep(patient_age,number_patients))
  )
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • I noticed you unaccepted the answer. Is anything missing from it? – Chris Ruehlemann May 17 '21 at 12:58
  • Hey Chris, I am not simply looking for the median age based on the second column. Column 3, "total", represents the number of people who are that age e.g. in my example table, there are 59 one-year olds who are in rank 1, 44 two-year olds in rank 1, 12 one-year olds in rank 2 etc. Your answer only shows me the median age without reference to the third column. Hope that makes sense. – Namra May 17 '21 at 14:25
  • So you want to group by `age` and `total`? – Chris Ruehlemann May 17 '21 at 14:27
  • See updated answer. – Chris Ruehlemann May 17 '21 at 14:31
  • So I used my own example and the intended output. When I used your solution, it did not give me the median output. I've clearly miscommunicated the question. I just want to calculate the median where the data is set out in a frequency table e.g. https://www.bbc.co.uk/bitesize/guides/z264jxs/revision/6. – Namra May 18 '21 at 09:40
  • Sorry, this is still unclear. The median of what? – Chris Ruehlemann May 18 '21 at 09:52
  • Hey, I've now changed the example to make the point clearer. Please take a look. – Namra May 18 '21 at 10:54
  • PLease see edited solution. I'm still working to find the median... – Chris Ruehlemann May 18 '21 at 13:55
  • So we have the means and the medians. Is that what you wanted? – Chris Ruehlemann May 18 '21 at 14:21