1

I have data which looks like this. It is a dataframe containing the date of birth (among other information) for a number of people.

library(tidyr)
library(dplyr)
library(magrittr)
library(lubridate)

df <- data.frame(
DATE_OF_BIRTH = c("20/10/01" , "15/04/88", "16/12/58", "15/10/91", "09/02/66", "02/07/03", "20/08/96", "22/04/99", "17/04/87", "17/08/56",
                "28/05/40", "26/07/59", "02/04/65", "17/08/93", "01/08/86", "30/07/01", "03/09/75", "17/09/65", "16/02/95", "11/06/03",
                "26/10/64", "25/02/73", "07/02/90", "31/03/38", "05/03/83", "10/02/61", "01/07/40", "15/08/51", "19/12/75", "25/11/58",
                "05/11/81", "05/12/02", "06/05/40", "23/09/69", "17/04/48", "02/07/58", "04/03/98", "26/11/03", "08/01/91", "23/12/07",
                "05/05/01", "23/10/08", "01/01/09", "29/10/63", "26/03/09", "03/02/75", "03/09/04", "17/01/80", "19/03/11", "05/07/83")
)

What I want to do is calculate the age of each person, based on their date of birth, as of 1st July 2017.

To calculate age I use the following code:

df <- df %>%
mutate(age = interval(start = dmy(df$DATE_OF_BIRTH), end = dmy('01/07/17')) / 
duration(num = 1, units = "years"))

The output from this is correct for some people, but for others I get a negative value. For these people, their actual age is the absolute value of age abs(age) plus 17.

Can someone tell me how to get only positive values for age? Thanks.

I have seen the following question: Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date , but this does not include the issue with the negative ages as an output.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Laura
  • 499
  • 5
  • 13
  • 1
    If the birth date was 23rd September 1969 (23/09/69), their age at 1st July 2017 would be 57 years. Why does having someone born in these earlier years get a negative value as the output? – Laura Jan 09 '19 at 01:59
  • 2
    How do you know they weren't born on 23rd September 2069? – Hong Ooi Jan 09 '19 at 02:21

2 Answers2

4

If you check the output of dmy function

head(df$DATE_OF_BIRTH)
#[1] "20/10/01" "15/04/88" "16/12/58" "15/10/91" "09/02/66" "02/07/03"

head(dmy(df$DATE_OF_BIRTH))
#[1] "2001-10-20" "1988-04-15" "2058-12-16" "1991-10-15" "2066-02-09" "2003-07-02"

R interprets years 00 - 68 as 2000 - 2068 and 69 - 99 as 1969 - 1999. Hence, 58 is considered as 2058, 66 is considered to 2066 but 88 is 1988.

From ?strptime

%y Year without century (00–99). On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 and 2008 POSIX standards, but they do also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change


For negative values you can add 100 to them to get equivalent positive values

library(dplyr)
library(lubridate)

df %>%
  mutate(age = interval(start = dmy(DATE_OF_BIRTH), end = dmy('01/07/17')) / 
          duration(num = 1, units = "years"), 
          age = if_else(age < 0, age + 100, age))


#   DATE_OF_BIRTH       age
#1       20/10/01 15.706849
#2       15/04/88 29.230137
#3       16/12/58 58.512329
#4       15/10/91 25.728767
#5       09/02/66 51.356164
#6       02/07/03 14.008219
#7       20/08/96 20.876712
#....

To get difference between dates in years, you could also use interval like this

df %>%
  mutate(age = interval(dmy(DATE_OF_BIRTH), dmy('01/07/17')) / years(1),
         age = if_else(age < 0, age + 100, age))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

You would need to cleanse the data as lubridate or as.Date() will both yield similar results.

For any converted year greater than today (illogical DoB), subtract 100 years to the converted date to make it relevant. The code below contains the cleansing portion described above. Good luck with the data analysis!

library(tidyr)
library(dplyr)
library(magrittr)
library(lubridate)


library(tidyr)
library(dplyr)
library(magrittr)
library(lubridate)

df <- data.frame(
  DATE_OF_BIRTH = c("20/10/01" , "15/04/88", "16/12/58", "15/10/91", "09/02/66", "02/07/03", "20/08/96", "22/04/99", "17/04/87", "17/08/56",
                    "28/05/40", "26/07/59", "02/04/65", "17/08/93", "01/08/86", "30/07/01", "03/09/75", "17/09/65", "16/02/95", "11/06/03",
                    "26/10/64", "25/02/73", "07/02/90", "31/03/38", "05/03/83", "10/02/61", "01/07/40", "15/08/51", "19/12/75", "25/11/58",
                    "05/11/81", "05/12/02", "06/05/40", "23/09/69", "17/04/48", "02/07/58", "04/03/98", "26/11/03", "08/01/91", "23/12/07",
                    "05/05/01", "23/10/08", "01/01/09", "29/10/63", "26/03/09", "03/02/75", "03/09/04", "17/01/80", "19/03/11", "05/07/83")

)


#set the date for comparison
comparisondate<-as.Date("2017-07-01")

#Retrieve the lubridate format and clean it for incorrect conversions
df$DOBnew<-dmy(df$DATE_OF_BIRTH)
#calculate the age
df$age<-round(as.numeric(difftime(comparisondate,df$DOBnew,units="weeks")/52.25),digits=1)
df[df$age<0,"DOBnew"]<-df[df$age<0,"DOBnew"] %m-% years(100)

#recalculate age
df$age<-round(as.numeric(difftime(comparisondate,df$DOBnew,units="weeks")/52.25),digits=1)
df$age



[1] 15.7 29.2 58.5 25.7 51.3 14.0 20.8 18.2 30.2 60.8 77.0 57.9 52.2 23.8 30.9 15.9 41.8 51.7 22.3 14.0
[21] 52.6 44.3 27.4 79.1 34.3 56.3 76.9 65.8 41.5 58.5 35.6 14.6 77.0 47.7 69.1 58.9 19.3 13.6 26.4  9.5
[41] 16.1  8.7  8.5 53.6  8.3 42.3 12.8 37.4  6.3 33.9

all(df$age>0)
[1] TRUE