Survival analysis: how to transform a population data frame into a data frame with the tidyverse / without a loop in R?

Question

I need to transform a data frame containing population information for each sampling date into a data frame with individual information to run a survival analysis. My data look like this:

Place=c(rep("Europe",6))
Age=c(rep("Newborn",3),rep("Young",3))
Date_sample=as.Date(c('2014-03-18','2014-10-01','2015-01-15','2014-06-16','2014-12-21','2015-01-15'))
Number_indiv_status1=c(0,2,1,0,2,2)
Number_indiv_status2=c(10,8,7,7,5,3)
df<-data.table(Place,Age,Date_sample,Number_indiv_status1,Number_indiv_status2)

> df
    Place     Age Date_sample Number_indiv_status1 Number_indiv_status2
1: Europe Newborn  2014-03-18                    0                   10
2: Europe Newborn  2014-10-01                    2                    8
3: Europe Newborn  2015-01-15                    1                    7
4: Europe   Young  2014-06-16                    0                    7
5: Europe   Young  2014-12-21                    2                    5
6: Europe   Young  2015-01-15                    2                    3

And I need to obtain this:

> new_df
     Place     Age Date_sample Number_indiv_status1 Number_indiv_status2 Status date_event
 1: Europe Newborn  2014-10-01                    2                    8      1 2014-05-30
 2: Europe Newborn  2014-10-01                    2                    8      1 2014-08-15
 3: Europe Newborn  2015-01-15                    1                    7      1 2014-12-17
 4: Europe Newborn  2015-01-15                    1                    7      2 2015-01-15
 5: Europe Newborn  2015-01-15                    1                    7      2 2015-01-15
 6: Europe Newborn  2015-01-15                    1                    7      2 2015-01-15
 7: Europe Newborn  2015-01-15                    1                    7      2 2015-01-15
 8: Europe Newborn  2015-01-15                    1                    7      2 2015-01-15
 9: Europe Newborn  2015-01-15                    1                    7      2 2015-01-15
10: Europe Newborn  2015-01-15                    1                    7      2 2015-01-15
11: Europe   Young  2014-12-21                    2                    5      1 2014-09-01
12: Europe   Young  2014-12-21                    2                    5      1 2014-09-21
13: Europe   Young  2015-01-15                    2                    3      1 2014-12-29
14: Europe   Young  2015-01-15                    2                    3      1 2015-01-02
15: Europe   Young  2015-01-15                    2                    3      2 2015-01-15
16: Europe   Young  2015-01-15                    2                    3      2 2015-01-15
17: Europe   Young  2015-01-15                    2                    3      2 2015-01-15

I wrote the following code, that does not work:

tot_lines <- df %>% group_by(Age) %>%  slice(1) %>% ungroup() %>% summarise(tot_lines=sum(Number_indiv_status2))
new_df <- data.frame(matrix(NA, nrow = tot_lines[[1]], ncol = 7))
colnames(new_df)=c(colnames(df),"Status","date_event")
k=0
for (i in 1:nrow(df)) {
  if(df[i,"Number_indiv_status1"]>0){
    for (j in 1:df[[i,"Number_indiv_status1"]]){
      new_df[k+j,c(1:5)]=df[i,c(1:5)]
      new_df[k+j,6]=1
      new_df[k+j,7]=sample(seq.POSIXt(as.POSIXct(df[[i-1,3]]), as.POSIXct(df[[i,3]]),by="day"), size = 1)   #random date between df[i,3] and df[i+1,3]
      k=sum(complete.cases(new_df))    
      }
    } else {
    }
  if(i==sum(df$Age=="Newborn")) {
    for (l in 1:df[i,"Number_indiv_status2"]) {
      new_df[k+l,c(1:5)]=df[l,c(1:5)]
      new_df[k+l,6]=2
      new_df[k+l,7]=df[i,3]
    } else {
    }
  }
  k=sum(complete.cases(new_df))
}

I have id several errors/tasks in the loop that I need to solve but cannot figure out:

there is a Date isssue here : new_df[2,c(1:5)]=df[2,c(1:5)] that I don't understand as class(df$Date_sample) returns "Date" cf this post. I have tried to use new_df[1,3]=ymd(df[[2,3]]) or new_df[1,3]=as_date(df[[2,3]]) as mentioned here, without success. I still get "16344" instead of ""2014-10-01" (which is the matching integer but not the date format). Why and how can I solve this?
I tried assigning a random date in the time interval following this, which does not work here: new_df[1,7]=sample(seq.POSIXt(as.POSIXct(df[[1,3]]), as.POSIXct(df[[2,3]]),by="day"), size = 1) I believe it is a matter of format, because it returns "1409443200" and as_date(1409443200) is not relevant ("3860894-05-31"). I also read this and this but I would like to avoid creating a function in or before the loop. I also checked the lubridate package to find an elegant option, but could not figure it out. If anyone has an idea about that option, it would be great.
As my loop does not work, I am not sure my indexes (i, j k and l) are well coded, and placed in the right place.
once the loop works : is there a way to insert that in a pipe %>% for example? I have actually more than one Place, and more than 2 Age classes, so I would need to group_by to operation by Place and Age, but append a single new data frame new_df.
Would there be a non-loop option to do the same, with the tidyverse for example? I try to avoid loops, but here I don't see how to manage it.
Last but not least: still new on the site, should I have asked my questions in separate posts?

Edit

I found a solution for point 1: setting new_df$Date_sample <- as.Date(new_df$Date_sample) before k=0 and entering the loop solves the format issue for new_df. I still don't know why using ymd() or as_date in the loop does not work though.
I found a way to assign a random date in the interval between two sampling times. I based my code on the python suggestion here (first answer) to get to this: sample(unclass(as.Date(df[[i,3]]))-unclass(as.Date(df[[i-1,3]])),1)+df[[i-1,3]] It also requires setting new_df$date_event <- as.Date(new_df$date_event) before k=0 and the loop, otherwise as before the result is right but not in the date format.

I keep working on the other errors, they are still unsolved.

Composing small functions would probably be way easier here than nested loops. I would start by writing a function that accepts arguments for place, age group, sample date, number of people with status 1, number of people with status 2. The function would return a data frame with individual level data given those arguments. That would be a good start. — Bill O'Brien, Aug 17 '21 at 14:35
@BillO'Brien I am really not familiar with writing functions, so I would not really know where to start. Also, for 1 and 2, I believe the same issues would still occur in a function because it is a formatting issue. — Mata, Aug 17 '21 at 14:50

score 0 · Accepted Answer · answered Aug 19 '21 at 15:05

I could get the loop to work, which solves the points 1-3. In the data frame, I needed to encode Age as factor: Age=as_factor(c(rep("Newborn",3),rep("Young",3)))

Then, this does the job:

k=0
Age_fact=1
for (i in 1:nrow(df)) {
  if(df[i,"Number_indiv_status1"]>0){
    for (j in 1:df[[i,"Number_indiv_status1"]]){
      new_df[k+j,c(1:5)]=df[i,c(1:5)]
      new_df[k+j,6]=1
      new_df[k+j,7]=sample(unclass(as.Date(df[[i,3]]))-unclass(as.Date(df[[i-1,3]])),1)+df[[i-1,3]]
    }
    k=sum(complete.cases(new_df)) 
    } 
  if(i==tail(which(df$Age == levels(df$Age)[Age_fact]),1)) {
    for (l in 1:df[[i,"Number_indiv_status2"]]) {
      new_df[k+l,c(1:5)]=df[i,c(1:5)]
      new_df[k+l,6]=2
      new_df[k+l,7]=df[i,3]
    }
    k=sum(complete.cases(new_df))
    } 
  if (i==tail(which(df$Age == levels(df$Age)[Age_fact]),1)) {
    Age_fact=Age_fact+1
  }
  k=sum(complete.cases(new_df))
}

One limit though: Age now appears by factor index (1 or 2) in new_df, instead of the name of the level. And setting new_df$Age <- as.factor(new_df$Age) before the loop does not solve it. I can still change it later, but as my data set is much larger than this, it would be great to get the copy to work as factor.

I still have this question: is there a way to do this without a loop, with the tidyverse?

Survival analysis: how to transform a population data frame into a data frame with the tidyverse / without a loop in R?

1 Answers1