0

Okay, I have a pretty heavy lifting problem for a loop in R that is taxing my knowledge. Any advice or direction is appreciated.

I have a list of dates and phone numbers. The phone number is being used as the index and can appear for multiple dates.

    library(dplyr)

    Date = as.Date(c("2019-03-01", "2019-03-15","2019-03-29",      "2019-04-10","2019-03-05","2019-03-20"))
    Phone = c("070000001","070000001","070000001","070000001","070000002","070000002")
    df<-data.frame(Date,Phone)

    df

    ##         Date     Phone
    ## 1 2019-03-01 070000001
    ## 2 2019-03-15 070000001
    ## 3 2019-03-29 070000001
    ## 4 2019-04-10 070000001
    ## 5 2019-03-05 070000002
## 6 2019-03-20 070000002

I then computed the difference between dates for each interaction

    df<-df %>%
      arrange(Phone,Date) %>%
      group_by(Phone) %>%
      mutate(Diff = Date - lag(Date))

    df

    ##   Date       Phone     Diff   
    ##   <date>     <fct>     <drtn> 
    ## 1 2019-03-01 070000001 NA days
    ## 2 2019-03-15 070000001 14 days
    ## 3 2019-03-29 070000001 14 days
    ## 4 2019-04-10 070000001 12 days
    ## 5 2019-03-05 070000002 NA days
    ## 6 2019-03-20 070000002 15 days

Now comes the hard part. I am trying to create a dynamic column range that counts the number of rows from the earliest date/phone pair until the sum of the data difference is =< 30. Then a subsequent column is created for the next incident. Once a new number begins the columns start back at 1. So it becomes an record on incidents in which multiple rows may have membership in each instance.

I have tried several iterations of for and while loop combinations but they were so terrible that I just deleted them. I also tried this as a matrix function but the logic was far to complicated for the query and even when going over a small sample set my computer wouldn't like it.

This is as far as I got.

current.sum <- 0
for (c in 1:nrow(df)) {
  current.sum <- current.sum + df[c, "Diff"]
  df[c, "Diff"] <- current.sum
  if (current.sum <= 30) {
    df[c, "Int"] <- nrow(current.sum)
    current.sum <- NA()
  }
}

The desired dataset would look like this.There are essentially five steps:

  • For each phone number
  • Sum Diff until it reaches 30
  • Count the number of rows
  • Create a column and place the row count next to the included rows
  • Restart (either for the next phone number or for a subsequent sum up to 30)
    ##         Date     Phone  Diff Int_1 Int_2
    ## 1 2019-03-01 070000001    NA     3    NA
    ## 2 2019-03-15 070000001    14     3    NA
    ## 3 2019-03-29 070000001    14     3     2
    ## 4 2019-04-10 070000001    12    NA     2
    ## 5 2019-03-05 070000002    NA     1    NA
    ## 6 2019-03-20 070000002    15     1    NA

Update 18/07/2019

I have been able to get 70% of the way, but am still missing the transposed columns and dual membership of date/phone combinations. I might be able to do with the transpose I can add this the solution below a way to evaluate the last group date to see if it should be included in membership rather than being exclusive to the 30 days threshold.

     cumsum_group <- function(x, threshold) {
      cumsum <- 0
      group <- 1
      result <- numeric()

      for (i in 1:length(x)) {
        cumsum <- cumsum + x[i]

        if (cumsum > threshold) {
          group <- group + 1
          cumsum <- x[i]
        }

        result = c(result, group)

      }

      return (result)
    }

    df<-df %>% group_by(binning=cumsum_group(Diff, 30))

The issue is that line 3 and 4 both belong to instance 2 of 070000001 the previous entry when added together equals 26. So the function needs to both lag to the previous value once the threshold is reached AND have a new column that allows that to be flagged.

  • please check your code. Right now the first line after while only contains `n <- ` – Cettt Jul 17 '19 at 14:45
  • Thanks Cettt, I am still trying to build it. This is where I am stuck. – Jacob Baisley Jul 18 '19 at 10:57
  • then please explain your desired output. Why is Int_1 `NA` at row 4 and 1 at row 5? – Cettt Jul 18 '19 at 11:54
  • Because the phone number has changed so the instance for that phone number goes back to the first column, or the first instance. If they do not start repeating from the first column we would end up with a series of columns of x length for every cumulative sum of 30 regardless of number. It would make it difficult to see how many repeat instances there are by each phone. I am open to changing the parameters to a continuous list of columns, we would just need a way to tackle the complexity of a dynamic number of columns that need to be reduced to instances for each phone number. – Jacob Baisley Jul 18 '19 at 12:06
  • What is getting me is the need to represent instances that might have membership in two columns such as line 3. It represents both the end and start of a new series. so needs to be counted twice. – Jacob Baisley Jul 18 '19 at 12:07
  • Not clear to me. Why there are two new columns in the output? What do they represent? Is it because you have got two `unique` phone numbers ? – Ronak Shah Jul 19 '19 at 06:51
  • Hello Ronak, it is not about phone number only, it is phone numbers and instances of 30 days. If you are summing instances of 30 days line `1` (NA) is the origin, so the days start counting from that day. Lines `2` and `3` summed together are 28, so lines `1,2,3` are part of the first instance of phone 1. When moving to instance 2 we need to start counting from line `3`. Line `3` and `4` sum to 26 and are the 2nd instance. But now line `3` is part of both the first instance of counting as well as the second and need to be represented in both instances. Instances are not mutually exclusive. – Jacob Baisley Jul 19 '19 at 08:07

0 Answers0