I need to create a running sequential number for every row in a group. The groups are both the artist IDs and also the course number. The course number is also a sequential running number created based off of a very specific criteria: If an artist goes more than 7 days in between recording, a new course number is created.
For example, let's say that we have an artist_id whose data looks like this:
artist_id | session_number_total | CustomerRecordId | SiteRecordId | recording_date | control_panel | year |
---|---|---|---|---|---|---|
1 | 1 | 3 | 3 | 1/1/2000 | Left | 2000 |
1 | 2 | 3 | 3 | 1/3/2000 | Right | 2000 |
1 | 3 | 3 | 3 | 1/8/2000 | Right | 2000 |
1 | 4 | 3 | 3 | 5/1/2000 | Left | 2000 |
This artist_id came in for a session on 1/1/2000, 1/3/2000, 1/8/2000, and 5/1/2000. Based off of the aforementioned criteria for creating the course number groups (no more than 7 days in between recording dates) while also adding in the running count of the number of sessions for that artist, the final dataset should look like this:
artist_id | session_number_total | CustomerRecordId | SiteRecordId | recording_date | control_panel | year | days_between | Status | course_number | session_in_course |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 3 | 3 | 1/1/2000 | Left | 2000 | 0 | Existing Course | 1 | 1 |
1 | 2 | 3 | 3 | 1/3/2000 | Right | 2000 | 2 | Existing Course | 1 | 2 |
1 | 3 | 3 | 3 | 1/8/2000 | Right | 2000 | 5 | Existing Course | 1 | 3 |
1 | 4 | 3 | 3 | 5/1/2000 | Left | 2000 | 114 | New Course | 2 | 1 |
I am able to achieve this using some very convoluted code with dplyr. It works every time but the problem is that with 2.5 million rows in my dataset, the code can take 20-30 minutes to run each time I open a new session in R.
Considering how great data.table is for large sets of data, I'm wondering if anyone knows a data.table solution for creating the sequential group numbers and the running session count inside each of the groups based off of the criteria mentioned above? Any help would be appreciated so thank you in advance.
Here is a reproducible dataset and the code I used in dplyr to accomplish creating the final dataset:
library(tidyverse)
library(scales)
library(tibbletime)
library(lubridate)
library(data.table)
x <- structure(list(artist_id = c(257L, 257L, 257L, 257L, 257L, 257L,
257L, 257L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L,
421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L, 421L), session_number_total = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L), CustomerRecordId = c(4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), SiteRecordId = c(5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), recording_date = structure(c(16062,
16063, 16065, 16066, 16074, 16079, 16087, 16092, 16027, 16028,
16035, 16038, 16056, 16058, 16065, 16072, 16073, 16077, 16079,
16083, 16086, 16087, 16090, 16091, 16094, 16111), class = "Date"),
control_panel = c("Left", "Left", "Left", "Left", "Left",
"Left", "Left", "Left", "Bilateral", "Bilateral", "Bilateral",
"Bilateral", "Bilateral", "Bilateral", "Bilateral", "Bilateral",
"Bilateral", "Bilateral", "Bilateral", "Bilateral", "Bilateral",
"Bilateral", "Bilateral", "Bilateral", "Bilateral", "Bilateral"
), year = c(2013L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L,
2014L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2014L,
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L)), class = "data.frame", row.names = c(NA, -26L))
x
final_df <- x %>%
arrange(artist_id, recording_date) %>%
group_by(artist_id) %>%
mutate(days_between = recording_date - lag(recording_date, 1)) %>%
arrange(artist_id, recording_date) %>%
mutate(days_between = ifelse(is.na(days_between), 0, days_between)) %>%
mutate(Status = ifelse(days_between > 7, "New Course", "Existing Course")) %>%
mutate(num = ifelse(Status == "New Course", seq(1, 100000), 1)) %>%
group_by(artist_id, Status, num) %>%
mutate(group_num = row_number()) %>%
ungroup() %>%
group_by(artist_id, group_num) %>%
mutate(course_number = ifelse(group_num == 1, seq(1, 100000),NA)) %>%
ungroup() %>%
group_by(artist_id) %>%
fill(course_number) %>%
ungroup() %>%
group_by(artist_id, course_number) %>%
mutate(session_in_course =row_number()) %>%
select(-num, -group_num)
final_df <- as.data.frame(final_df)
final_df