Okay, I have a pretty heavy lifting problem for a loop in R that is taxing my knowledge. Any advice or direction is appreciated.
I have a list of dates and phone numbers. The phone number is being used as the index and can appear for multiple dates.
library(dplyr)
Date = as.Date(c("2019-03-01", "2019-03-15","2019-03-29", "2019-04-10","2019-03-05","2019-03-20"))
Phone = c("070000001","070000001","070000001","070000001","070000002","070000002")
df<-data.frame(Date,Phone)
df
## Date Phone
## 1 2019-03-01 070000001
## 2 2019-03-15 070000001
## 3 2019-03-29 070000001
## 4 2019-04-10 070000001
## 5 2019-03-05 070000002
## 6 2019-03-20 070000002
I then computed the difference between dates for each interaction
df<-df %>%
arrange(Phone,Date) %>%
group_by(Phone) %>%
mutate(Diff = Date - lag(Date))
df
## Date Phone Diff
## <date> <fct> <drtn>
## 1 2019-03-01 070000001 NA days
## 2 2019-03-15 070000001 14 days
## 3 2019-03-29 070000001 14 days
## 4 2019-04-10 070000001 12 days
## 5 2019-03-05 070000002 NA days
## 6 2019-03-20 070000002 15 days
Now comes the hard part. I am trying to create a dynamic column range that counts the number of rows from the earliest date/phone pair until the sum of the data difference is =< 30. Then a subsequent column is created for the next incident. Once a new number begins the columns start back at 1. So it becomes an record on incidents in which multiple rows may have membership in each instance.
I have tried several iterations of for and while loop combinations but they were so terrible that I just deleted them. I also tried this as a matrix function but the logic was far to complicated for the query and even when going over a small sample set my computer wouldn't like it.
This is as far as I got.
current.sum <- 0
for (c in 1:nrow(df)) {
current.sum <- current.sum + df[c, "Diff"]
df[c, "Diff"] <- current.sum
if (current.sum <= 30) {
df[c, "Int"] <- nrow(current.sum)
current.sum <- NA()
}
}
The desired dataset would look like this.There are essentially five steps:
- For each phone number
- Sum Diff until it reaches 30
- Count the number of rows
- Create a column and place the row count next to the included rows
- Restart (either for the next phone number or for a subsequent sum up to 30)
## Date Phone Diff Int_1 Int_2
## 1 2019-03-01 070000001 NA 3 NA
## 2 2019-03-15 070000001 14 3 NA
## 3 2019-03-29 070000001 14 3 2
## 4 2019-04-10 070000001 12 NA 2
## 5 2019-03-05 070000002 NA 1 NA
## 6 2019-03-20 070000002 15 1 NA
Update 18/07/2019
I have been able to get 70% of the way, but am still missing the transposed columns and dual membership of date/phone combinations. I might be able to do with the transpose I can add this the solution below a way to evaluate the last group date to see if it should be included in membership rather than being exclusive to the 30 days threshold.
cumsum_group <- function(x, threshold) {
cumsum <- 0
group <- 1
result <- numeric()
for (i in 1:length(x)) {
cumsum <- cumsum + x[i]
if (cumsum > threshold) {
group <- group + 1
cumsum <- x[i]
}
result = c(result, group)
}
return (result)
}
df<-df %>% group_by(binning=cumsum_group(Diff, 30))
The issue is that line 3 and 4 both belong to instance 2 of 070000001 the previous entry when added together equals 26. So the function needs to both lag to the previous value once the threshold is reached AND have a new column that allows that to be flagged.