0

I have a data frame like this:

ID <- c("A", "B", "C", "D")
birthday <- c(12, 23, 2, 20)
birthmonth <- c(8, 10, 3, 9)
birthyear <- c(79, 62, 66, 83)
mydf <- data.frame(ID, birthday, birthmonth, birthyear)
mydf
  ID birthday birthmonth birthyear
1  A       12          8        79
2  B       23         10        62
3  C        2          3        66
4  D       20          9        83

So, as you can see birth years are stated as 2 digits, and month, day, and year information are on different columns. In such a data frame, how can I calculate mean age for my sample?

Thank you so much!

dplyr
  • 83
  • 5

1 Answers1

1

We could use lubridate's make_date() to turn the individual columns into a date column and then calculate the age. I have shown here how you could take care of the missing 19/20 in birthyear, but you might need to tweak it for your data.

library(dplyr)
library(lubridate)

mydf |> 
    mutate(date = make_date(if_else(birthyear > 21, birthyear+1900, birthyear), birthmonth, birthday),
           age  = as.period(interval(date, today()))$year
    )

Output:

  ID birthday birthmonth birthyear       date age
1  A       12          8        79 1979-08-12  43
2  B       23         10        62 1962-10-23  59
3  C        2          3        66 1966-03-02  56
4  D       20          9        83 1983-09-20  38

And to get the mean age with summarise:

mydf |> 
    mutate(date = make_date(if_else(birthyear > 21, birthyear+1900, birthyear), birthmonth, birthday),
           age  = as.period(interval(date, today()))$year
    ) |>
    summarise(mean_age = mean(age))

Output:

  mean_age
1       49

Update: It can be non-trivial to get the right age calculation (fast), check e.g. Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date

harre
  • 7,081
  • 2
  • 16
  • 28
  • 1
    This is what happened: I don't know why my error has changed to this --> in `mutate()`: ! Problem while computing `date = make_date(...)`. Caused by error in `if_else()`: ! `false` must be a double vector, not an integer vector. and then I changed if_else to ifelse and wrote the code like this: mydf <- mydf |> mutate(date = make_date(ifelse(birthyear > 21, birthyear+1900, birthyear), birthmonth, birthday), age = as.period(interval(date, today()))$year ) and then it worked. I don't know the reason behind it but anyway I wanted to share. Thank you! – dplyr Aug 28 '22 at 15:54
  • It has to do with `birthyear` not being a numeric type. You might want to add `mutate(birthyear = as.numeric(birthyear)` to be sure not to get unexpected behaviour. – harre Aug 28 '22 at 16:01