0

I would like to calculate age based on date of birth and reference date. However, I have a large set of data and am working with data.table.

I considered the following function, by user @MichaelChirico, available at: Efficient and accurate age calculation (in years, months, or weeks) in R given birth date and an arbitrary date.

library(data.table)
get_age <- function(birthdays, ref_dates){
  x <- data.table(bday <- unclass(birthdays),
                  #rem: how many days has it been since the lapse of the
                  #  most recent quadrennium since your birth?
                  rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
  #cycle_type: which of the four years following your birthday
  #  was the one that had 366 days? 
  x[ , cycle_type := 
       foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
                 #these intervals were calculated by hand;
                 #  e.g., 59 is Feb. 28, 1970. I made the judgment
                 #  call to say that those born on Feb. 29 don't
                 #  have their "birthday" until the following March 1st.
                 data.table(start = c(0L, 59L, 424L, 790L, 1155L), 
                            end = c(58L, 423L, 789L, 1154L, 1460L), 
                            val = c(3L, 2L, 1L, 4L, 3L),
                            key = "start,end"))$val]
  I4 <- diag(4L)[ , -4L] #for conciseness below
  #The `by` approach might seem a little abstruse for those
  #  not familiar with `data.table`; see the edit history
  #  for a more palatable version (which is also slightly slower)
  x[ , extra := 
       foverlaps(data.table(start = rem, end = rem),
                 data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
                                                     I4[.BY[[1L]],])),
                            end = c(st[-1L] - 1L, 1461L),
                            int_yrs = 0:3, key = "start,end")
       )[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
  #grand finale -- 4 years for every quadrennium, plus the fraction:
  4L * ((ref - bday) %/% 1461L) + x$extra
}

The problem is that I have some empty entries for date of birth which causes me to get the following error message:

Caused by error in `foverlaps()`:
! NA values in data.table 'x' start column: 'start'. All rows with NA values in the range columns must be removed for foverlaps() to work.

I cannot miss these remarks. I would like to assign output NA to age when input in date of birth is NA. Any idea how I can do this?

For example:

test <- structure(list(city = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), date = c(10101992, 
15101996, 1031997, 1061900, 13011870, 14071983, 11121995, NA, 
11121995, 29021996), reference = c(20032023, 20032023, 20032023, 
20032023, 20032023, 20032023, 20032023, 20032023, 20032023, 20032023
), date1 = structure(c(8318, 9784, 9921, -25416, -36512, 4942, 
9475, NA, 9475, 9555), class = "Date"), reference1 = structure(c(19436, 
19436, 19436, 19436, 19436, 19436, 19436, 19436, 19436, 19436
), class = "Date")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))



test$date1 <- lubridate::dmy(test$date)
test$reference1 <- lubridate::dmy(test$reference)
test$age <- get_age(test$date1, test$reference1)

Gives me the error:

Error in foverlaps(data.table(start = bdr <- bday%%1461L, end = bdr), :
NA values in data.table 'x' start column: 'start'. All rows with NA values in the range columns must be removed for foverlaps() to work.

I would like to get:

structure(list(city = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), date = c(10101992, 
15101996, 1031997, 1061900, 13011870, 14071983, 11121995, NA, 
11121995, 29021996), reference = c(20032023, 20032023, 20032023, 
20032023, 20032023, 20032023, 20032023, 20032023, 20032023, 20032023
), date1 = c("1992-10-10", "1996-10-15", "1997-03-01", "1900-06-01", 
"1870-01-13", "1983-07-14", "1995-12-11", NA, "1995-12-11", "1996-02-29"
), reference1 = c("2023-03-20", "2023-03-20", "2023-03-20", "2023-03-20", 
"2023-03-20", "2023-03-20", "2023-03-20", "2023-03-20", "2023-03-20", 
"2023-03-20"), age = c(30.441095890411, 26.427397260274, 26.051912568306, 
122.8, 153.178082191781, 39.6803278688525, 27.2704918032787, 
NA, 27.2704918032787, 27.051912568306)), row.names = c(NA, 10L
), class = "data.frame")
Velton Sousa
  • 345
  • 1
  • 9
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Mar 20 '23 at 22:00
  • Adjusted @MrFlick ! – Velton Sousa Mar 20 '23 at 22:15

1 Answers1

0

maybe remove NAs before get_age function.

Could be like this

test$date1 <- lubridate::dmy(test$date)
test$reference1 <- lubridate::dmy(test$reference)
id = !is.na(test$date1) & !is.na(test$reference1)
res <- get_age(test$date1[id],
                    test$reference1[id])
test$age=NA        
test$age[id]=res
Wael
  • 1,640
  • 1
  • 9
  • 20