2

Here is a simplified version of what my data set looks like:

 > df
   ID total_sleep sleep_end_date
1   1           9     2017-09-03
2   1           8     2017-09-04
3   1           7     2017-09-05
4   1          10     2017-09-06
5   1          11     2017-09-07
6   2           5     2017-09-03
7   2          12     2017-09-04
8   2           4     2017-09-05
9   2           3     2017-09-06
10  2           6     2017-09-07

Where total_sleep is expressed in hours.

What I am is trying to find is the absolute difference in hours of sleep for every two consecutive dates, given a specific user ID. The desired output should look something like this:

> df_answer

   ID total_sleep sleep_end_date      diff_hours_of_sleep
1   1           9     2017-09-03                       NA
2   1           8     2017-09-04                        1
3   1           7     2017-09-05                        1
4   1          10     2017-09-06                        3
5   1          11     2017-09-07                        1
6   2           5     2017-09-03                       NA
7   2          12     2017-09-04                        7
8   2           4     2017-09-05                        8
9   2           3     2017-09-06                        1
10  2           6     2017-09-08                       NA

NA appears in rows 1 and 6 because it doesn't have any data concerning the day before.

Most importantly, NA appears in row 10 because I don't have any data concerning the previous day (2017-09-07). And this has been the trickiest part to code for me.

I've googled (meaning: "stackoverflowed") this and tried to find a solution using the "data wrangling cheatsheet" for dplyr, but I haven't been been able to find a function that enables me to do what I want taking into account these two variables: date and different user IDs.

I am a beginner in R, so I might indeed be missing something simple. Any input or suggestion would be very welcome!

nespereira
  • 67
  • 6

3 Answers3

2
## Order data.frame by IDs, then by increasing sleep_end_dates (if not already sorted)
df <- df[order(df$ID, df$sleep_end_date),]

## Calculate difference in total_sleep with previous entry
df$diff_hours_of_sleep <- c(NA,abs(diff(df$total_sleep)))

## If previous ID is not equal, replace diff_hours_of_sleep with NA
ind <- c(NA, diff(df$ID))
df$diff_hours_of_sleep[ind != 0] <- NA

## And if previous day wasn't yesterday, replace diff_hours_of_sleep with NA
day_ind <- c(NA, diff(df$sleep_end_date))
df$diff_hours_of_sleep[day_ind != 1] <- NA
dvantwisk
  • 561
  • 3
  • 11
1

Maybe the following will do it.

df <- lapply(split(df, df$ID), function(x){
        y <- ifelse(diff(x$sleep_end_date) == 1, abs(diff(x$total_sleep)), NA)
        x$diff_hours_of_sleep <- c(NA, y)
        x
})
df <- do.call(rbind, df)
df
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
0

Here is a solution using data.table -

dt1 <- data.table(df, key=c('id', 'sleep_end_date'))
merge(
  dt1[,.(id, total_sleep, sleep_end_date, i=.I - 1)],
  dt1[,.(id, total_sleep, i=.I)], by=c('id','i'), all.x=TRUE)  [,.(id,sleep_end_date,\
total_sleep.x,delta=total_sleep.y-total_sleep.x)]
    id sleep_end_date total_sleep.x delta
 1:  1     2017-09-03             9    NA
 2:  1     2017-09-04             8     1
 3:  1     2017-09-05             7     1
 4:  1     2017-09-06            10    -3
 5:  1     2017-09-07            11    -1
 6:  2     2017-09-03             5    NA
 7:  2     2017-09-04            12    -7
 8:  2     2017-09-05             4     8
 9:  2     2017-09-06             3     1
10:  2     2017-09-07             6    -3

I'm not sure how the peformance compares to the pure data.frame approach, but it does appear to scale well; extending the input set to 20,000 rows this took under one second on my system.

gcbenison
  • 11,723
  • 4
  • 44
  • 82