0

I need to create a running sequential number for every row in a group. The groups are both the artist IDs and also the course number. The course number is also a sequential running number created based off of a very specific criteria: If an artist goes more than 7 days in between recording, a new course number is created.

For example, let's say that we have an artist_id whose data looks like this:

artist_id session_number_total CustomerRecordId SiteRecordId recording_date control_panel year
1 1 3 3 1/1/2000 Left 2000
1 2 3 3 1/3/2000 Right 2000
1 3 3 3 1/8/2000 Right 2000
1 4 3 3 5/1/2000 Left 2000

This artist_id came in for a session on 1/1/2000, 1/3/2000, 1/8/2000, and 5/1/2000. Based off of the aforementioned criteria for creating the course number groups (no more than 7 days in between recording dates) while also adding in the running count of the number of sessions for that artist, the final dataset should look like this:

artist_id session_number_total CustomerRecordId SiteRecordId recording_date control_panel year days_between Status course_number session_in_course
1 1 3 3 1/1/2000 Left 2000 0 Existing Course 1 1
1 2 3 3 1/3/2000 Right 2000 2 Existing Course 1 2
1 3 3 3 1/8/2000 Right 2000 5 Existing Course 1 3
1 4 3 3 5/1/2000 Left 2000 114 New Course 2 1

I am able to achieve this using some very convoluted code with dplyr. It works every time but the problem is that with 2.5 million rows in my dataset, the code can take 20-30 minutes to run each time I open a new session in R.

Considering how great data.table is for large sets of data, I'm wondering if anyone knows a data.table solution for creating the sequential group numbers and the running session count inside each of the groups based off of the criteria mentioned above? Any help would be appreciated so thank you in advance.

Here is a reproducible dataset and the code I used in dplyr to accomplish creating the final dataset:

library(tidyverse)
library(scales)
library(tibbletime)
library(lubridate)
library(data.table)

x <- structure(list(artist_id = c(257L, 257L, 257L, 257L, 257L, 257L, 
                             257L, 257L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 
                             421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L), session_number_total = c(1L, 
                                                                                                             2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
                                                                                                             10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L), CustomerRecordId = c(4L, 
                                                                                                                                                                                4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
                                                                                                                                                                                5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), SiteRecordId = c(5L, 5L, 
                                                                                                                                                                                                                                      5L, 5L, 5L, 5L, 5L, 5L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
                                                                                                                                                                                                                                      10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), recording_date = structure(c(16062, 
                                                                                                                                                                                                                                                                                                                      16063, 16065, 16066, 16074, 16079, 16087, 16092, 16027, 16028, 
                                                                                                                                                                                                                                                                                                                      16035, 16038, 16056, 16058, 16065, 16072, 16073, 16077, 16079, 
                                                                                                                                                                                                                                                                                                                      16083, 16086, 16087, 16090, 16091, 16094, 16111), class = "Date"), 
               control_panel = c("Left", "Left", "Left", "Left", "Left", 
                                 "Left", "Left", "Left", "Bilateral", "Bilateral", "Bilateral", 
                                 "Bilateral", "Bilateral", "Bilateral", "Bilateral", "Bilateral", 
                                 "Bilateral", "Bilateral", "Bilateral", "Bilateral", "Bilateral", 
                                 "Bilateral", "Bilateral", "Bilateral", "Bilateral", "Bilateral"
               ), year = c(2013L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L, 
                           2014L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2014L, 
                           2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
                           2014L)), class = "data.frame", row.names = c(NA, -26L))

x

final_df <- x %>%
  arrange(artist_id, recording_date) %>%
  group_by(artist_id) %>%
  mutate(days_between = recording_date - lag(recording_date, 1)) %>%
  arrange(artist_id, recording_date) %>%
  mutate(days_between = ifelse(is.na(days_between), 0, days_between)) %>%
  mutate(Status = ifelse(days_between > 7, "New Course", "Existing Course")) %>%
  mutate(num = ifelse(Status == "New Course", seq(1, 100000), 1)) %>%
  group_by(artist_id, Status, num) %>%
  mutate(group_num = row_number()) %>%
  ungroup() %>%
  group_by(artist_id, group_num) %>%
  mutate(course_number = ifelse(group_num == 1, seq(1, 100000),NA)) %>%
  ungroup() %>%
  group_by(artist_id) %>%
  fill(course_number) %>%
  ungroup() %>%
  group_by(artist_id, course_number) %>%
  mutate(session_in_course =row_number()) %>%
  select(-num, -group_num)

final_df <- as.data.frame(final_df)

final_df
  • 1
    I believe this question might be what you're looking for - https://stackoverflow.com/questions/12925063/numbering-rows-within-groups-in-a-data-frame – Gabe Solomon May 07 '21 at 15:44
  • I saw that question earlier but it does not include any criteria for when a new group should be created, which is central to the question I asked. If it was just a matter of using the data.table equivalent to row_number() in dplyr then it would be an easy task. I already incorporate row_number() in multiple lines in my dplyr solution code above, and you can see all of the other data wrangling necessary to get the final output that is desired. I'm wondering if there is a way to do all of these steps in data.table. – user15863797 May 07 '21 at 15:55

1 Answers1

1

How about this data.table solution:

library(data.table)
setDT(x)
x[, days_between := c(0, diff(recording_date)), by = .(artist_id)
  ][, course_number := 1L + cumsum(days_between > 7), by = .(artist_id)
  ][, session_in_course := seq_len(.N), by = .(artist_id, course_number)]
#     artist_id session_number_total CustomerRecordId SiteRecordId recording_date control_panel  year days_between course_number session_in_course
#         <int>                <int>            <int>        <int>         <Date>        <char> <int>        <num>         <int>             <int>
#  1:       257                    1                4            5     2013-12-23          Left  2013            0             1                 1
#  2:       257                    2                4            5     2013-12-24          Left  2013            1             1                 2
#  3:       257                    3                4            5     2013-12-26          Left  2013            2             1                 3
#  4:       257                    4                4            5     2013-12-27          Left  2013            1             1                 4
#  5:       257                    5                4            5     2014-01-04          Left  2014            8             2                 1
#  6:       257                    6                4            5     2014-01-09          Left  2014            5             2                 2
#  7:       257                    7                4            5     2014-01-17          Left  2014            8             3                 1
#  8:       257                    8                4            5     2014-01-22          Left  2014            5             3                 2
#  9:       421                    1                5           10     2013-11-18     Bilateral  2013            0             1                 1
# 10:       421                    2                5           10     2013-11-19     Bilateral  2013            1             1                 2
# 11:       421                    3                5           10     2013-11-26     Bilateral  2013            7             1                 3
# 12:       421                    4                5           10     2013-11-29     Bilateral  2013            3             1                 4
# 13:       421                    5                5           10     2013-12-17     Bilateral  2013           18             2                 1
# 14:       421                    6                5           10     2013-12-19     Bilateral  2013            2             2                 2
# 15:       421                    7                5           10     2013-12-26     Bilateral  2013            7             2                 3
# 16:       421                    8                5           10     2014-01-02     Bilateral  2014            7             2                 4
# 17:       421                    9                5           10     2014-01-03     Bilateral  2014            1             2                 5
# 18:       421                   10                5           10     2014-01-07     Bilateral  2014            4             2                 6
# 19:       421                   11                5           10     2014-01-09     Bilateral  2014            2             2                 7
# 20:       421                   12                5           10     2014-01-13     Bilateral  2014            4             2                 8
# 21:       421                   13                5           10     2014-01-16     Bilateral  2014            3             2                 9
# 22:       421                   14                5           10     2014-01-17     Bilateral  2014            1             2                10
# 23:       421                   15                5           10     2014-01-20     Bilateral  2014            3             2                11
# 24:       421                   16                5           10     2014-01-21     Bilateral  2014            1             2                12
# 25:       421                   17                5           10     2014-01-24     Bilateral  2014            3             2                13
# 26:       421                   18                5           10     2014-02-10     Bilateral  2014           17             3                 1
#     artist_id session_number_total CustomerRecordId SiteRecordId recording_date control_panel  year days_between course_number session_in_course
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Your solution (as well as the solution from @Waldi) works perfectly. This is truly amazing- I was aware of data.table's better performance over dplyr with large sets of data, but this is more than I could have expected. Like I mentioned in the original post, my original dplyr code took about 20-25 minutes to run when using my actual dataset of 2.5 million rows. Your data.table code (as well as Waldi's) takes approximately 5 seconds to run for those same 2.5 million rows. You two have officially converted me to the dark (data.table) side- I have some learning to do now. Thank you both so much! – user15863797 May 09 '21 at 23:17