0

I have performed counts of events (in Group 1) over a time period for each group (in Group 2). I am looking to spread Group 1 events into separate columns and use Group 2 and timestamp as rows. Each cell will contain the counts of events over a time period (Present date to the previous 4 days).

See the example below, for each of Group 2 (I & II), I counted Events A and L in Group 1 that happened within 4 days.

dates = as.Date(c("2011-10-09",
   "2011-10-15",
   "2011-10-16", 
   "2011-10-18", 
   "2011-10-21", 
   "2011-10-22", 
   "2011-10-24")) 
group1=c("A",
    "A",
    "A", 
    "A", 
    "L", 
    "L", 
    "A")
group2=c("I",
    "I",
    "I", 
    "I", 
    "I", 
    "I", 
    "II")

df1 <- data.frame(dates, group1, group2) 

Using dplyr pipes I managed to produce the following table (also see Count event types over time series by multiple conditions)

df1 %>%
  group_by(group1, group2) %>%
  mutate(count = sapply(dates
                    , function(x){
                      sum(dates <= x & dates > (x-4))
                      }))


   dates group1 group2 count
  <date> <fctr> <fctr> <int>
1 2011-10-09      A      I     1
2 2011-10-15      A      I     1
3 2011-10-16      A      I     2
4 2011-10-18      A      I     3
5 2011-10-21      L      I     1
6 2011-10-22      L      I     2
7 2011-10-24      A     II     1

Eventually, I want to obtain a table similar to this, with Events A & L counts updated according to dates (time period = current date - 4 days) in both I & II (Group 2).

         dates  group1 group2  count (A)   count (L)
     1 2011-10-09      A      I        1         0
     2 2011-10-15      A      I        1         0
     3 2011-10-16      A      I        2         0
     4 2011-10-18      A      I        3         0
     5 2011-10-21      L      I        0         1
     6 2011-10-22      L      I        0         2
     7 2011-10-24      A      II       1         0

In a larger dataset, not all events in Group 1 appears in Group 2. How can I update these empty cells so that it will either 1) carry forward the count from the previous row or 2) update the count based on the updated timestamp/ time period?

Thanks!

rakeshboliya
  • 103
  • 12
byc
  • 121
  • 10
  • In your example, you show `0` for `count (A)` in rows 5 and 6. However, there was one event (on the 18th) that was within 4 days of the 21st (row 5), though the count should return to `0` on the 21st. Is your example an accurate depiction of how you want data displayed? Similarly, if there are two entries for a day, do you want a row for each? Is there a reason you want to keep the `group1` and `group2` columns instead of just showing the event counts? Are you still trying to do separate counts for each `group2` subgroup? – Mark Peterson Dec 13 '17 at 14:42
  • A lot of this may be answered by giving us a motivating example. In each of the two previous questions, and especially now, it is unclear what the desired outcome is because it is unclear how it will be used. – Mark Peterson Dec 13 '17 at 14:44

1 Answers1

0

While it is still a bit unclear what you want (see comments on the question), here are two potential approaches.

If all you want to do is spread the count column out (for some reason) and fill it with 0's (whether there was an event in the preceding 4 days or not) and still count by the group2 breakdown (even though you are only labeling by group1) and leave the event details in place (like your example in your question), you can just create a column with the labels you want, then use spread to create the new columns. This

df1 %>%
  group_by(group1, group2) %>%
  mutate(count = sapply(dates
                        , function(x){
                          sum(dates <= x & dates > (x-4))
                        })) %>%
  ungroup() %>%
  mutate(toSpread = paste0("Count (", group1, ")")) %>%
  spread(toSpread, count, fill = 0)

returns this:

       dates group1 group2 `Count (A)` `Count (L)`
*     <date> <fctr> <fctr>       <dbl>       <dbl>
1 2011-10-09      A      I           1           0
2 2011-10-15      A      I           1           0
3 2011-10-16      A      I           2           0
4 2011-10-18      A      I           3           0
5 2011-10-21      L      I           0           1
6 2011-10-22      L      I           0           2
7 2011-10-24      A     II           1           0

Which matches the output you have shown in your question. However, if what you want is a count on any day with an event of how many of each group1's events have occurred, you will need to step back a bit further. For that, you need to generate a new data frame with the dates you want -- with a row for each group. This is easy to get using complete from tidyr. Then, you can check each of those for events that occurred in the preceding four days for that group.

df1 %>%
  select(dates, group1) %>%
  complete(dates, group1) %>%
  mutate(count = sapply(1:n()
                        , function(idx){
                          sum(df1$dates <= dates[idx] &
                                df1$dates > (dates[idx]-4) &
                                df1$group1 == group1[idx])
                        })) %>%
  mutate(group1 = paste0("Count (", group1, ")")) %>%
  spread(group1, count, fill = 0)

returns:

# A tibble: 7 x 3
       dates `Count (A)` `Count (L)`
*     <date>       <dbl>       <dbl>
1 2011-10-09           1           0
2 2011-10-15           1           0
3 2011-10-16           2           0
4 2011-10-18           3           0
5 2011-10-21           1           1
6 2011-10-22           0           2
7 2011-10-24           1           2

Note that, if you want to include days for which there were no events, you can do that by passing the dates you want checked into complete. For example:

df1 %>%
  select(dates, group1) %>%
  complete(dates = full_seq(dates, 1), group1) %>%
  mutate(count = sapply(1:n()
                        , function(idx){
                          sum(df1$dates <= dates[idx] &
                                df1$dates > (dates[idx]-4) &
                                df1$group1 == group1[idx])
                        })) %>%
  mutate(group1 = paste0("Count (", group1, ")")) %>%
  spread(group1, count, fill = 0)

returns:

        dates `Count (A)` `Count (L)`
 *     <date>       <dbl>       <dbl>
 1 2011-10-09           1           0
 2 2011-10-10           1           0
 3 2011-10-11           1           0
 4 2011-10-12           1           0
 5 2011-10-13           0           0
 6 2011-10-14           0           0
 7 2011-10-15           1           0
 8 2011-10-16           2           0
 9 2011-10-17           2           0
10 2011-10-18           3           0
11 2011-10-19           2           0
12 2011-10-20           1           0
13 2011-10-21           1           1
14 2011-10-22           0           2
15 2011-10-23           0           2
16 2011-10-24           1           2

Based on the comments, I think I am finally understanding the goal. First, I would start out by, as above, creating a "long" data frame with the counts for each group1/group2 pair for every date:

fullDateCounts <-
  df1 %>%
  select(dates, group1, group2) %>%
  complete(dates = full_seq(dates, 1), group1, group2) %>%
  mutate(count = sapply(1:n()
                        , function(idx){
                          sum(df1$dates <= dates[idx] &
                                df1$dates > (dates[idx]-4) &
                                df1$group1 == group1[idx] &
                                df1$group2 == group2[idx]
                              )
                        }))

The top of this is:

        dates group1 group2 count
       <date> <fctr> <fctr> <int>
 1 2011-10-09      A      I     1
 2 2011-10-09      A     II     0
 3 2011-10-09      L      I     0
 4 2011-10-09      L     II     0
 5 2011-10-10      A      I     1
 6 2011-10-10      A     II     0
 7 2011-10-10      L      I     0
 8 2011-10-10      L     II     0
 9 2011-10-11      A      I     1
10 2011-10-11      A     II     0
# ... with 54 more rows

From there, if you really need to convert to a wide form, you can either do so with a row for each group2 (or group1, if you switch the column names):

fullDateCounts %>%
  mutate(group1 = paste0("Count (", group1, ")")) %>%
  spread(group1, count, fill = 0)

returns:

        dates group2 `Count (A)` `Count (L)`
 *     <date> <fctr>       <dbl>       <dbl>
 1 2011-10-09      I           1           0
 2 2011-10-09     II           0           0
 3 2011-10-10      I           1           0
 4 2011-10-10     II           0           0
 5 2011-10-11      I           1           0
 6 2011-10-11     II           0           0
 7 2011-10-12      I           1           0
 8 2011-10-12     II           0           0
 9 2011-10-13      I           0           0
10 2011-10-13     II           0           0
# ... with 22 more rows

Or, you can generate a column for each group1/group2 pair:

fullDateCounts %>%
  mutate(toSpread = paste0("Count (", group1, "-", group2, ")")) %>%
  select(-group1, -group2) %>%
  spread(toSpread, count, fill = 0)

returns

        dates `Count (A-I)` `Count (A-II)` `Count (L-I)` `Count (L-II)`
 *     <date>         <dbl>          <dbl>         <dbl>          <dbl>
 1 2011-10-09             1              0             0              0
 2 2011-10-10             1              0             0              0
 3 2011-10-11             1              0             0              0
 4 2011-10-12             1              0             0              0
 5 2011-10-13             0              0             0              0
 6 2011-10-14             0              0             0              0
 7 2011-10-15             1              0             0              0
 8 2011-10-16             2              0             0              0
 9 2011-10-17             2              0             0              0
10 2011-10-18             3              0             0              0
11 2011-10-19             2              0             0              0
12 2011-10-20             1              0             0              0
13 2011-10-21             1              0             1              0
14 2011-10-22             0              0             2              0
15 2011-10-23             0              0             2              0
16 2011-10-24             0              1             2              0
Mark Peterson
  • 9,370
  • 2
  • 25
  • 48
  • Indeed count (A) in rows 5 and 6 should be 1 and 0, where the counts are calculated within every Group 2. Thanks for highlighting this. Putting dates, Group 1 and Group 2 in context. In my dataset, Group 2 refers to participant IDs in the study, and group 1 subcategories refer to events that are recorded in each participant. The date column refers to each of the Event from Group 1 occurred. Not all participants (group2) reported events/ subcategories measured in Group 1. – byc Dec 13 '17 at 23:16
  • By making this table, I am hoping to find out at every Group 1 event date, each participant (group 2 subcategory) reported X number of Event A, Y Number of Event L etc in the preceding 4 days.So yes I am still trying to do separate counts for each group 2 subgroup. Thanks for proposing these solutions! I suppose I need to add select (dates, group1, group2) and group_by (group2) in the dplyr pipe to update the answer? – byc Dec 13 '17 at 23:17
  • I think that the edit I just made addresses your need. In the future, I would suggest using descriptive labels (e.g., event and participant) to make it more clear what you are trying to accomplish. – Mark Peterson Dec 14 '17 at 21:05
  • Will keep that in mind! It works perfectly on this example dataset, though when I applied it to a larger dataset R gave me an error ''vector size is too large exceeding xx GB''. Presumably, my events (group1) dates spread over years and R was working hard to complete the date info. How do you tackle memory issues in completing time series data? – byc Dec 17 '17 at 11:05
  • The easiest way to tackle memory issues is to chunk your data into more digestible chunks. (Or get more memory.) – Mark Peterson Dec 17 '17 at 13:52