0

another day with new complex faced

Below are the columns and rows that I have as input:

ID  Age
123 23 Years 1 Month 2 Days
125 28 Years 9 Month 14 Days
126 28 years
127 34 YEAR
128 35 Years 8 Month 21 Days
129 38 Years 5 Month 25 Days
130 32.8

I need them as yearly calculated in new columns like:

ID  Age                      Age_new
123 23 Years 1 Month 2 Days     23.1
125 28 Years 9 Month 14 Days    28.9
126 28 years                    28
127 34 YEAR                     34
128 35 Years 8 Month 21 Days    35.8
129 38 Years 5 Month 25 Days    38.5
130 32.8                        32.8

I have tried the by stringr package but I get only first character string which doesn't provide like the above.

James Z
  • 12,209
  • 10
  • 24
  • 44
Rebel_47
  • 69
  • 4
  • There is no perfect way to do this: months can be 28, 29, 30, or 31 days long, so the decimal-year you're showing is an approximation. Are you assuming that a month is `30.41667` days long? What about leap years, do you expect `30.5` days per month then? How are we to know which to use? – r2evans Jan 13 '22 at 03:46
  • If the figure after the decimal is just the number of months, you might get sorting issues (eg 23 years and 11 months - 23.11 - would get sorted before 23 years and 2 months - 23.2 - if it's sorted numerically or in character order) – Hobo Jan 13 '22 at 03:52
  • What is the purpose of this, by the time I write the code and paste it here their age in digits is already outdated. Get their date of birth. If you know the date from when the data was taken, you can easily substract that date with the data you have as duration since birth. When you have the dob you can visualise it always in an accurate way. – Merijn van Tilborg Jan 13 '22 at 08:08
  • 1
    Closely related: [Transforming complete age from character to numeric in R](https://stackoverflow.com/questions/70191127/transforming-complete-age-from-character-to-numeric-in-r). To apply [my answer there](https://stackoverflow.com/a/70191534), you only need to convert to lower case and add "years" to values without unit: `library(lubridate)`; `time_length(period(sub("(^\\d+\\.?\\d*$)", "\\1 years", tolower(dat$Age))), unit = "years")`. – Henrik Jan 13 '22 at 08:43

2 Answers2

1

Here's a gross approximation:

func <- function(x, ptn) {
  out <- gsub(paste0(".*?\\b([0-9.]+)\\s*", ptn, ".*"), "\\1", x, ignore.case = TRUE)
  ifelse(out == x, NA, out)
}

library(dplyr)
dat %>%
  mutate(
    data.frame(
      lapply(c(yr = "year", mon = "month", day = "day"),
             function(ptn) as.numeric(func(Age, ptn)))
    ),
    yr = if_else(is.na(yr), suppressWarnings(as.numeric(Age)), yr),
    across(c(yr, mon, day), ~ coalesce(., 0)), New_Age = yr + mon/12 + day/365
  )
#    ID                      Age   yr mon day  New_Age
# 1 123  23 Years 1 Month 2 Days 23.0   1   2 23.08881
# 2 125 28 Years 9 Month 14 Days 28.0   9  14 28.78836
# 3 126                 28 years 28.0   0   0 28.00000
# 4 127                  34 YEAR 34.0   0   0 34.00000
# 5 128 35 Years 8 Month 21 Days 35.0   8  21 35.72420
# 6 129 38 Years 5 Month 25 Days 38.0   5  25 38.48516
# 7 130                     32.8 32.8   0   0 32.80000

(I offer no warranty on true accuracy.)


Data

dat <- structure(list(ID = c(123L, 125L, 126L, 127L, 128L, 129L, 130L), Age = c("23 Years 1 Month 2 Days", "28 Years 9 Month 14 Days", "28 years", "34 YEAR", "35 Years 8 Month 21 Days", "38 Years 5 Month 25 Days", "32.8")), class = "data.frame", row.names = c(NA, -7L))
r2evans
  • 141,215
  • 6
  • 77
  • 149
0

This is my approach. I always try to avoid regex since it's too scary for me. If your data is exactly separated like your example, I think my code will work. I completely understand this is not the most efficient way. but heyy it works

dat %>% 
  mutate(space_counter = stringr::str_count(Age," ")) %>% 
  tidyr::separate(Age,into = paste0("tmp_col_",1:(max(.$space_counter)+1)),sep = " ") %>% 
  select(ID, tmp_col_1,tmp_col_3,tmp_col_5) %>% 
  setNames(c("ID","year","month","day")) %>% 
  mutate(across(everything(), ~replace_na(.x, 0))) %>% 
  mutate_if(is.character,as.integer) %>% 
  mutate(asdur = as.duration(years(year) + months(month) + days(day))) %>% 
  mutate(age_new = as.numeric(asdur)/3.154e+7)

output: enter image description here

Joe Christian
  • 11
  • 1
  • 3