I need to transform a data frame containing population information for each sampling date into a data frame with individual information to run a survival analysis. My data look like this:
Place=c(rep("Europe",6))
Age=c(rep("Newborn",3),rep("Young",3))
Date_sample=as.Date(c('2014-03-18','2014-10-01','2015-01-15','2014-06-16','2014-12-21','2015-01-15'))
Number_indiv_status1=c(0,2,1,0,2,2)
Number_indiv_status2=c(10,8,7,7,5,3)
df<-data.table(Place,Age,Date_sample,Number_indiv_status1,Number_indiv_status2)
> df
Place Age Date_sample Number_indiv_status1 Number_indiv_status2
1: Europe Newborn 2014-03-18 0 10
2: Europe Newborn 2014-10-01 2 8
3: Europe Newborn 2015-01-15 1 7
4: Europe Young 2014-06-16 0 7
5: Europe Young 2014-12-21 2 5
6: Europe Young 2015-01-15 2 3
And I need to obtain this:
> new_df
Place Age Date_sample Number_indiv_status1 Number_indiv_status2 Status date_event
1: Europe Newborn 2014-10-01 2 8 1 2014-05-30
2: Europe Newborn 2014-10-01 2 8 1 2014-08-15
3: Europe Newborn 2015-01-15 1 7 1 2014-12-17
4: Europe Newborn 2015-01-15 1 7 2 2015-01-15
5: Europe Newborn 2015-01-15 1 7 2 2015-01-15
6: Europe Newborn 2015-01-15 1 7 2 2015-01-15
7: Europe Newborn 2015-01-15 1 7 2 2015-01-15
8: Europe Newborn 2015-01-15 1 7 2 2015-01-15
9: Europe Newborn 2015-01-15 1 7 2 2015-01-15
10: Europe Newborn 2015-01-15 1 7 2 2015-01-15
11: Europe Young 2014-12-21 2 5 1 2014-09-01
12: Europe Young 2014-12-21 2 5 1 2014-09-21
13: Europe Young 2015-01-15 2 3 1 2014-12-29
14: Europe Young 2015-01-15 2 3 1 2015-01-02
15: Europe Young 2015-01-15 2 3 2 2015-01-15
16: Europe Young 2015-01-15 2 3 2 2015-01-15
17: Europe Young 2015-01-15 2 3 2 2015-01-15
I wrote the following code, that does not work:
tot_lines <- df %>% group_by(Age) %>% slice(1) %>% ungroup() %>% summarise(tot_lines=sum(Number_indiv_status2))
new_df <- data.frame(matrix(NA, nrow = tot_lines[[1]], ncol = 7))
colnames(new_df)=c(colnames(df),"Status","date_event")
k=0
for (i in 1:nrow(df)) {
if(df[i,"Number_indiv_status1"]>0){
for (j in 1:df[[i,"Number_indiv_status1"]]){
new_df[k+j,c(1:5)]=df[i,c(1:5)]
new_df[k+j,6]=1
new_df[k+j,7]=sample(seq.POSIXt(as.POSIXct(df[[i-1,3]]), as.POSIXct(df[[i,3]]),by="day"), size = 1) #random date between df[i,3] and df[i+1,3]
k=sum(complete.cases(new_df))
}
} else {
}
if(i==sum(df$Age=="Newborn")) {
for (l in 1:df[i,"Number_indiv_status2"]) {
new_df[k+l,c(1:5)]=df[l,c(1:5)]
new_df[k+l,6]=2
new_df[k+l,7]=df[i,3]
} else {
}
}
k=sum(complete.cases(new_df))
}
I have id several errors/tasks in the loop that I need to solve but cannot figure out:
there is a
Date
isssue here :new_df[2,c(1:5)]=df[2,c(1:5)]
that I don't understand asclass(df$Date_sample)
returns "Date" cf this post. I have tried to usenew_df[1,3]=ymd(df[[2,3]])
ornew_df[1,3]=as_date(df[[2,3]])
as mentioned here, without success. I still get "16344" instead of ""2014-10-01" (which is the matching integer but not the date format). Why and how can I solve this?I tried assigning a random date in the time interval following this, which does not work here:
new_df[1,7]=sample(seq.POSIXt(as.POSIXct(df[[1,3]]), as.POSIXct(df[[2,3]]),by="day"), size = 1)
I believe it is a matter of format, because it returns "1409443200" and as_date(1409443200) is not relevant ("3860894-05-31"). I also read this and this but I would like to avoid creating a function in or before the loop. I also checked thelubridate
package to find an elegant option, but could not figure it out. If anyone has an idea about that option, it would be great.As my loop does not work, I am not sure my indexes (i, j k and l) are well coded, and placed in the right place.
once the loop works : is there a way to insert that in a pipe
%>%
for example? I have actually more than one Place, and more than 2 Age classes, so I would need to group_by to operation by Place and Age, but append a single new data frame new_df.Would there be a non-loop option to do the same, with the
tidyverse
for example? I try to avoid loops, but here I don't see how to manage it.Last but not least: still new on the site, should I have asked my questions in separate posts?
Edit
I found a solution for point 1: setting
new_df$Date_sample <- as.Date(new_df$Date_sample)
beforek=0
and entering the loop solves the format issue for new_df. I still don't know why usingymd()
oras_date
in the loop does not work though.I found a way to assign a random date in the interval between two sampling times. I based my code on the python suggestion here (first answer) to get to this:
sample(unclass(as.Date(df[[i,3]]))-unclass(as.Date(df[[i-1,3]])),1)+df[[i-1,3]]
It also requires settingnew_df$date_event <- as.Date(new_df$date_event)
before k=0 and the loop, otherwise as before the result is right but not in the date format.
I keep working on the other errors, they are still unsolved.