How do I filter a first event per day per group, select a variable in another column based on a condition, and run calculations on values inbetween?

Question

In R, I have a data frame with the columns id (representing study participants), phase, time, glucose, steps, and kiloCalories. id and phase are factors, time is POSIXcT and includes date + time, glucose (sampled every ~15 minutes) steps (sampled every minute), and kiloCalories (sampled irregularly, represents an eaten meal) are numeric. Glucose and kiloCalories data is much less frequently sampled than steps, so it contains lots of NAs.

I would like to filter this data frame in the following ways:

Retrieve the rows with the first meal of the day of each participant (id), and their glucose reading 2 hours (+-15 minutes) before that meal.
Retrieve the rows with each meal (i.e. each kiloCalories entry) of each participant (id), along with the glucose reading 2 hours (+-15 minutes) after the meal.
From task 2, take the subset of data in between meal and glucose reading, and calculate the sum of steps within that time.

The reason I specify 2 hours (+-15 minutes) is because there is a very low probability that the data frame has a glucose reading exactly 2 hours after a meal is eaten, thus I want to extend the timeframe

I've tried this StackOverflow thread on how to subset based on time and condition, but to no avail, leaving me stuck at my first task. And that thread does not talk about the complex subsetting I'd like to perform.

Edit - Here is some sample data which meets the criteria of the tasks:

sampleData <- structure(list(id = c(13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 
13, 13, 13, 13), phase = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), time = structure(c(1450881900, 
1450881960, 1450882020, 1450882080, 1450882140, 1450882200, 1450882260, 
1450882320, 1450882380, 1450882440, 1450882500, 1450882560, 1450882620, 
1450882680, 1450882740, 1450882800, 1450882860, 1450882920, 1450882980, 
1450883040, 1450883100, 1450883160, 1450883220, 1450883280, 1450883340, 
1450883400, 1450883460, 1450883520, 1450883580, 1450883640, 1450883700, 
1450883760, 1450883820, 1450883880, 1450883940, 1450884000, 1450884060, 
1450884120, 1450884180, 1450884240, 1450884300, 1450884360, 1450884420, 
1450884480, 1450884540, 1450884600, 1450884660, 1450884720, 1450884780, 
1450884840, 1450884900, 1450884960, 1450885020, 1450885080, 1450885140, 
1450885200, 1450885260, 1450885320, 1450885380, 1450885440, 1450885500, 
1450885560, 1450885620, 1450885680, 1450885740, 1450885800, 1450885860, 
1450885920, 1450885980, 1450886040, 1450886100, 1450886160, 1450886220, 
1450886280, 1450886340, 1450886400, 1450886460, 1450886520, 1450886580, 
1450886640, 1450886700, 1450886760, 1450886820, 1450886880, 1450886940, 
1450887000, 1450887060, 1450887120, 1450887180, 1450887240, 1450887300, 
1450887360, 1450887420, 1450887480, 1450887540, 1450887600, 1450887660, 
1450887720, 1450887780, 1450887840, 1450887900, 1450887960, 1450888020, 
1450888080, 1450888140, 1450888200, 1450888260, 1450888320, 1450888380, 
1450888440, 1450888500, 1450888560, 1450888620, 1450888680, 1450888740, 
1450888800, 1450888860, 1450888920, 1450888980, 1450889040, 1450889100, 
1450889160, 1450889220, 1450889280, 1450889340, 1450889400, 1450889460, 
1450889520, 1450889580, 1450889640, 1450889700, 1450889760, 1450889820, 
1450889880, 1450889940, 1450890000, 1450890060, 1450890120, 1450890180, 
1450890240, 1450890300, 1450890360, 1450890420, 1450890480, 1450890540, 
1450890600, 1450890660, 1450890720, 1450890780, 1450890840, 1450890900, 
1450890960, 1450891020, 1450891080, 1450891140, 1450891200, 1450891260, 
1450891320, 1450891380, 1450891440, 1450891500, 1450891560, 1450891620, 
1450891680, 1450891740, 1450891800, 1450891860, 1450891920, 1450891980, 
1450892040, 1450892100, 1450892160, 1450892220, 1450892280, 1450892340, 
1450892400, 1450892460, 1450892520, 1450892580, 1450892640, 1450892700, 
1450892760, 1450892820, 1450892880, 1450892940, 1450893000, 1450893060, 
1450893120, 1450893180, 1450893240, 1450893300, 1450893360, 1450893420, 
1450893480, 1450893540, 1450893600, 1450893660, 1450893720, 1450893780, 
1450893840, 1450893900, 1450893960, 1450894020, 1450894080, 1450894140, 
1450894140, 1450894200, 1450894260, 1450894320, 1450894380, 1450894440, 
1450894500, 1450894560, 1450894620, 1450894680, 1450894740, 1450894800, 
1450894860, 1450894920, 1450894980, 1450895040, 1450895100, 1450895160, 
1450895220, 1450895280, 1450895340, 1450895400, 1450895460, 1450895520, 
1450895580, 1450895640, 1450895700, 1450895760, 1450895820, 1450895880, 
1450895940, 1450896000, 1450896060, 1450896120, 1450896180, 1450896240, 
1450896300, 1450896360, 1450896420, 1450896480, 1450896540, 1450896600, 
1450896660, 1450896720, 1450896780, 1450896840, 1450896900, 1450896960, 
1450897020, 1450897080, 1450897140, 1450897200, 1450897260, 1450897320, 
1450897380, 1450897440, 1450897500, 1450897560, 1450897620, 1450897680, 
1450897740, 1450897800, 1450897860, 1450897920, 1450897980, 1450898040, 
1450898100, 1450898160, 1450898220, 1450898280, 1450898340, 1450898400, 
1450898460, 1450898520, 1450898580, 1450898640, 1450898700, 1450898760, 
1450898820, 1450898880, 1450898940, 1450899000, 1450899060, 1450899120, 
1450899180, 1450899240, 1450899300, 1450899360, 1450899420, 1450899480, 
1450899540, 1450899600, 1450899660, 1450899720, 1450899780, 1450899840, 
1450899900), class = c("POSIXct", "POSIXt")), gl = c(NA, NA, 
NA, NA, NA, NA, NA, NA, 84, NA, NA, NA, NA, 83, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 81, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 82, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 84, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 83, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 79, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
76, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 78, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 93, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 116, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 128, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 141, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 142, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 146, 
143, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
136, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
129, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
134, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
139, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
134, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
128, NA, NA, NA, NA, NA, NA), steps = c(24, 39, 28, 19, 29, 6, 
12, 3, 13, 1, 6, 2, 1, 13, 10, 1, 1, 1, 1, 0, 0, 1, 1, 3, 1, 
0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1, 0, 3, 33, 27, 17, 27, 
30, 19, 23, 34, 38, 25, 30, 42, 31, 31, 16, 52, 91, 39, 23, 7, 
6, 27, 64, 20, 53, 22, 14, 14, 5, 4, 13, 7, 13, 7, 8, 10, 14, 
26, 25, 19, 23, 35, 23, 15, 13, 12, 11, 27, 21, 25, 27, 4, 8, 
18, 15, 22, 30, 16, 15, 15, 5, 3, 4, 6, 0, 12, 10, 4, 3, 5, 2, 
5, 10, 13, 7, 2, 6, 2, 1, 15, 23, 25, 18, 27, 5, 11, 22, 31, 
17, 27, 19, 2, 0, 12, 3, 0, 5, 5, 0, 0, 1, 0, 2, 2, 2, 5, 4, 
4, 1, 7, 2, 5, 4, 8, 2, 4, 0, 4, 6, 8, 11, 10, 22, 2, 1, 0, 4, 
4, 2, 2, 9, 19, 8, 11, 7, 7, 4, 0, 1, 0, 2, 3, 13, 9, 0, 3, 4, 
5, 5, 7, 5, 5, 8, 8, 26, 23, 26, 27, 24, 24, 13, 25, 17, 24, 
24, 11, 16, 15, 25, 21, 18, 11, 16, 19, 2, 0, 7, 6, 6, 3, 1, 
13, 13, 0, 1, 10, 12, 10, 9, 7, 1, 1, 12, 4, 0, 0, 0, 5, 2, 5, 
2, 1, 2, 0, 1, 2, 5, 11, 0, 0, 2, 1, 0, 2, 0, 7, 1, 0, 0, 0, 
0, 1, 0, 3, 1, 0, 1, 0, 0, 3, 10, 13, 1, 8, 4, 1, 0, 0, 1, 0, 
23, 22, 11, 16, 16, 5, 5, 5, 3, 14, 2, 0, 0, 0, 1, 2, 0, 1, 2, 
3, 1), kiloCalories = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 603, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 143, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA)), row.names = c(NA, -302L), class = c("tbl_df", 
"tbl", "data.frame"))

Welcome. Please share your data using ``dput()``. Thanks. – user438383 Aug 21 '21 at 14:04 — user438383, Aug 21 '21 at 14:04
Thank you, @user438383 I just added some data. – TIV Aug 21 '21 at 15:08 — TIV, Aug 21 '21 at 15:08

Ben · Accepted Answer · 2021-08-22T16:15:26.267

I believe there may be a number of considerations on how you want to organize your data, depending on how you intend to analyze further. However, here are some ideas that may be helpful for you.

This solution uses tidyverse and fuzzyjoin as you tagged with dplyr - but you may want to consider a data.table or sqldf solution as alternatives, depending on size of data, speed needed, and other factors.

First, I would create a table that includes the meals based on kiloCalories values that are not missing. We will create a meal column and enumerate meals for each date. In addition, we can calculate your windows for preprandial and postprandial glucose levels.

library(tidyverse)
library(fuzzyjoin)

mealsData <- sampleData %>%
  filter(!is.na(kiloCalories)) %>%
  group_by(id, date = date(time)) %>%
  mutate(meal = 1:n(),
         preprandial_1 = time - (60 * 60 * 2) - (15 * 60),
         preprandial_2 = time - (60 * 60 * 2) + (15 * 60),
         postprandial_1 = time + (60 * 60 * 2) - (15 * 60),
         postprandial_2 = time + (60 * 60 * 2) + (15 * 60)) %>%
  select(-gl, -steps, -kiloCalories)

The result is this for mealsData:

     id phase time                date        meal preprandial_1       preprandial_2       postprandial_1      postprandial_2     
  <dbl> <dbl> <dttm>              <date>     <int> <dttm>              <dttm>              <dttm>              <dttm>             
1    13     1 2015-12-23 12:00:00 2015-12-23     1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00
2    13     1 2015-12-23 13:30:00 2015-12-23     2 2015-12-23 11:15:00 2015-12-23 11:45:00 2015-12-23 15:15:00 2015-12-23 15:45:00

I have found such tables to be very useful as reference.

Next, you can merge this table with your sampleData. For task 1, you want preprandial first meal glucose levels. So, you can use fuzzy_join and ensure the times are between the calculated preprandial times determined.

fuzzy_inner_join(
  mealsData %>% filter(meal == 1),
  sampleData %>% filter(!is.na(gl)),
  by = c("id", "phase", "preprandial_1" = "time", "preprandial_2" = "time"),
  match_fun = c(`==`, `==`, `<=`, `>=`)
)

The result is:

   id.x phase.x time.x              date        meal preprandial_1       preprandial_2       postprandial_1      postprandial_2       id.y phase.y time.y                 gl steps kiloCalories
  <dbl>   <dbl> <dttm>              <date>     <int> <dttm>              <dttm>              <dttm>              <dttm>              <dbl>   <dbl> <dttm>              <dbl> <dbl>        <dbl>
1    13       1 2015-12-23 12:00:00 2015-12-23     1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00    13       1 2015-12-23 09:53:00    84    13           NA
2    13       1 2015-12-23 12:00:00 2015-12-23     1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00    13       1 2015-12-23 09:58:00    83    13           NA
3    13       1 2015-12-23 12:00:00 2015-12-23     1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00    13       1 2015-12-23 10:08:00    81     3           NA

It appears there are 3 glucose levels that fall within that window from the sample data.

Next, you can do something similar for postprandial data, for all meals:

fuzzy_inner_join(
  mealsData,
  sampleData %>% filter(!is.na(gl)),
  by = c("id", "phase", "postprandial_1" = "time", "postprandial_2" = "time"),
  match_fun = c(`==`, `==`, `<=`, `>=`)
)

The result is:

   id.x phase.x time.x              date        meal preprandial_1       preprandial_2       postprandial_1      postprandial_2       id.y phase.y time.y                 gl steps kiloCalories
  <dbl>   <dbl> <dttm>              <date>     <int> <dttm>              <dttm>              <dttm>              <dttm>              <dbl>   <dbl> <dttm>              <dbl> <dbl>        <dbl>
1    13       1 2015-12-23 12:00:00 2015-12-23     1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00    13       1 2015-12-23 13:54:00   134     0           NA
2    13       1 2015-12-23 12:00:00 2015-12-23     1 2015-12-23 09:45:00 2015-12-23 10:15:00 2015-12-23 13:45:00 2015-12-23 14:15:00    13       1 2015-12-23 14:09:00   139     1           NA

Here there are two glucose levels postprandial found.

Finally, you can merge the data.frames and then group by the id (id.x used since the join created a duplicate), the meal and the date. Then you can sum up the steps:

fuzzy_inner_join(
  mealsData,
  sampleData,
  by = c("id", "phase", "time" = "time", "postprandial_2" = "time"),
  match_fun = c(`==`, `==`, `<=`, `>=`)
) %>%
  group_by(id.x, meal, date) %>%
  summarise(step_sum = sum(steps))

The result is:

   id.x  meal date       step_sum
  <dbl> <int> <date>        <dbl>
1    13     1 2015-12-23      876
2    13     2 2015-12-23      294

Edit 1: You might also try using data.table for a faster solution. Using setDT will make the data.frame a data.table:

library(data.table)

setDT(mealsData)
setDT(sampleData)

Then, you can do a nonequi join between your sampleData and mealsData. This statement includes which columns you want to include in the result, and merging based on times. The nomatch will leave out results where there is no match (for example, no post-prandial glucose levels for second meal).

sampleData[!is.na(gl)][
  mealsData,
  .(id, phase, gl, x.time),
  on = .(id, phase, time >= postprandial_1, time <= postprandial_2),
  nomatch = 0]

To get the sum of steps, you can try:

sampleData[mealsData, 
           .(id, phase, meal, date, steps), 
           on = .(id, phase, time >= time, time <= postprandial_2), 
           nomatch = 0][
  , 
  .(step_sum = sum(steps)), 
  by = .(id, meal, date)]

The results should be the same as above.

Edit 2: You can merge both the second and third outcomes (average glucose and sum of steps). Make sure both have id, phase, meal and date to merge on. The first dt1 now includes the mean glucose and stores the associated meal. Store both dt1 and dt2 in intermediate data.tables:

dt1 <- sampleData[!is.na(gl)][
  mealsData,
  .(id, phase, gl, x.time, meal, date),
  on = .(id, phase, time >= postprandial_1, time <= postprandial_2),
  nomatch = 0][
    ,
    .(gl_ave = mean(gl)), 
    by = .(id, phase, meal, date)]

dt2 <- sampleData[mealsData, 
           .(id, phase, meal, date, steps), 
           on = .(id, phase, time >= time, time <= postprandial_2), 
           nomatch = 0][
  , 
  .(step_sum = sum(steps)), 
  by = .(id, phase, meal, date)]

Then merge:

merge(dt1, dt2, by = c("id", "phase", "meal", "date"))

Hello @Ben, thank you for being so descriptive. I've been able to recreate your first fuzzy_inner_join with good results. Unfortunately, the runtime of the second fuzzy_inner_join is too long. I've stopped after 3 hours. My sampleData has ~650,000 rows, and mealsData generates ~3,000 rows. What would be a workaround for this? — TIV, Aug 22 '21 at 12:38
The `data.table` and `sqldf` solutions likely would be faster (I alluded to those early on in the answer for this reason). Would you like me to edit the answer with one of these options? Happy to. — Ben, Aug 22 '21 at 13:12
Yes please, it would help a lot if you could provide the code to the data.table and sqldf alternative(s). — TIV, Aug 22 '21 at 13:13
@TIV Please see bottom of answer above with `data.table` option. Let me know if this helps and if you have questions. — Ben, Aug 22 '21 at 13:44
@TIV Also, you may want to subset your data and try on 1,000 rows or so, just to check its performance and accuracy, before trying on full dataset. — Ben, Aug 22 '21 at 13:51
thank you. Both `data.table` commands run well and generate results. The sum of steps results are quite similar to the results of @jamoreiras, in the answer below. How would I go about merging the tables from 2) and 3)? Essentially, for every meal I would like a row that includes the identifying info, glucose at meal, mean glucose 2h+-15min after meal, and sum of steps. Is there a way to get only a mean reading of glucose for the 2h+-15min timeframe in both 1) and 2)? That would make merging easier. — TIV, Aug 22 '21 at 15:49
running the dplyr script for postprandial reaction on a subset of 1,000 rows took 45 seconds, i.e. would take ~8 hours for the whole dataset. — TIV, Aug 22 '21 at 15:53
Yes, you can merge the two with `merge`. Make sure both have `id`, `phase`, `meal` and `date` to merge on. It does not have glucose at a given meal, since that was `NA` in your dataset, but that could be added as well when making the `mealsData` (and just carrying that value through). See edited answer (Edit 2). — Ben, Aug 22 '21 at 16:11

score 0 · Answer 2 · answered Aug 21 '21 at 19:38

0

Since your the data frame sampleData is sorted and includes one observation per minute this can be used in your advantage:

library(dplyr)
library(zoo)

1) Retrieve the rows with the first meal of the day of each participant (id), and their glucose reading 2 hours (+-15 minutes) before that meal:

sampleData$gl <- na.locf(sampleData$gl, na.rm=FALSE)

df1 <- sampleData %>% 
  mutate(previousGl = lag(gl,120), glTime = lag(time, 120)) %>%  
  filter(!is.na(kiloCalories))

2) Retrieve the rows with each meal (i.e. each kiloCalories entry) of each participant (id), along with the glucose reading 2 hours (+-15 minutes) after the meal.

sampleData$gl <- na.locf(sampleData$gl, fromLast = TRUE,na.rm=FALSE)

df2 <- sampleData %>% 
  mutate(previousGl = lag(gl,120), glTime = lead(time, 120)) %>%  
  filter(!is.na(kiloCalories))

3) From task 2, take the subset of data in between meal and glucose reading, and calculate the sum of steps within that time.

lapply(1:NROW(df2), function(i) {
  sampleData %>% filter(time >= df2$time[i], 
                        time <= df2$glTime[i]) %>%
    summarize(steps = sum(steps))
})

answered Aug 21 '21 at 19:38

jamoreiras

315
1
14

Hello @jamoreiras, thank you for your help. If I run your suggestions for 1) and 2) in sequence, they generate the same dataframes. If I 'reset' sampleData$gl to it's initial state before running 2), I get two differing data frames. These dataframes (df1 & df2) differ both in the gl column, and the previousGl column. I understand the difference in the previousGl column, but how can they have different gl readings for the same time? Can you help me interpret this? – TIV Aug 22 '21 at 13:12
First, you are correct, you are supposed to reset the dataframe between runs 1) and 2). – jamoreiras Aug 22 '21 at 15:00
Second, there are readings of glucose every 15 minutes. If the time of the meals is random and considering the margin of +/- 15 minutes you gave, than if we are looking for the glucose reading **after** the meal, we need to take the glucose reading from the next reading. Depending on the exact time of the meal, adding 2 hours, will get us always a reading 2 hours + 15 minutes if the meal was not on the exact hour, in which case is a reading 2 hours after the meal exactly. – jamoreiras Aug 22 '21 at 15:09
This approach takes advantage we are dealing with an sorted dataframe with readings every minute. If that were not the case, the whole readings would be inaccurate. Still, df2 retrieves 2 lines for which the second line doesn’t get a reading even though there is a reading 15 minutes before we get to the 2 hours after the meal. – jamoreiras Aug 22 '21 at 15:16

How do I filter a first event per day per group, select a variable in another column based on a condition, and run calculations on values inbetween?

2 Answers2