1

I need to detect a sequence by group in a data.frame and compute new variable.

Consider I have this following data.frame:

df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
              seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
              count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
              product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
              stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))

df1

> df1
   ID seqs count product   stock
1   1    1     2       A       A
2   1    2     1       B     A,B
3   1    3     3       C   A,B,C
4   1    4     1       C   A,B,C
5   1    5     1     A,B   A,B,C
6   1    6     2   A,B,C   A,B,C
7   1    7     3       D A,B,C,D
8   2    1     1       A       A
9   2    2     2       B     A,B
10  2    3     1       A     A,B
11  3    1     3       A       A
12  3    2     1   A,B,C   A,B,C
13  3    3     4       D A,B,C,D
14  3    4     1       D A,B,C,D

I am interested to compute a measure for ID that follow this sequence:

  - Count == 1
  - Count > 1
  - Count == 1

In the example this is true for:

 - rows 2, 3, 4 for `ID==1`
 - rows 8, 9, 10 for `ID==2`
 - rows 12, 13, 14 for `ID==3`

For these ID and rows, I need to compute a measure called new that takes the value of the product of the last row of the sequence if it is in the second row of the sequence and NOT in the stock of the first sequence.

The desired outcome is shown below:

> output
  ID seq1 seq2 seq3 new
1  1    2    3    4   C
2  2    1    2    3    
3  3    2    3    4   D

Note:

  1. In the sequence detected for ID no new products are added to the stock.
  2. In the original data there are a lot of IDs who do not have any sequences.
  3. Some ID have multiple qualifying sequences. All should be recorded.
  4. Count is always 1 or greater.
  5. The original data holds millions of ID with up to 1500 sequences.

How would you write an efficient piece of code to get this output?

M--
  • 25,431
  • 8
  • 61
  • 93
wake_wake
  • 1,332
  • 2
  • 19
  • 46

2 Answers2

1

Here's a data.table option:

library(data.table)

char_cols <- c("product", "stock")
setDT(df1)[, 
           (char_cols) := lapply(.SD, as.character), 
           .SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) & 
            (shift(count) > 1) & 
            (shift(count, 2L) == 1), 
     by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) & 
            !mapply(grepl, pat, shift(stock, 2L)), 
    by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result
josemz
  • 1,283
  • 7
  • 15
  • `setDT` converts `data.frame` to `data.table`. You can convert back using `setDF`. – M-- Apr 12 '19 at 21:49
1

Here's another approach using ; however, I think lag and lead has made this solution a bit time-consuming. I included the comments within the code to make it more legible.

But I spent enough time on it, to post it anyway.

library(tidyverse)

df1 %>% group_by(ID) %>%  

 # this finds the row with count > 1 which ...
 #... the counts of the row before and the one of after it equals to 1
 mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>% 

 # this makes a column which has value of True for each chunk...      
 #that meets desired condition to later filter based on it
 mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F))  %>% 

 filter(test2) %>% ungroup() %>% 

 # group each three occurrences in case of having multiple ones within each ID
 group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>% 

 # creating new column with string extracting techniques ...
 #... (assuming those columns are characters) 
 mutate(new=
 str_remove_all(
    as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
               stock[1])) %>% 

  # selecting desired columns and adding times for long to wide conversion
  select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>% 

  # long to wide conversion using tidyr (part of tidyverse)
  gather(key, value, -ID, -G, -new, -times) %>%
  unite(col, key, times) %>% spread(col, value) %>% 

  # making the desired order of columns
  select(-G,-new,new) %>% as.data.frame()

#   ID seqs_1 seqs_2 seqs_3 new
# 1  1      2      3      4   C
# 2  2      1      2      3    
# 3  3      2      3      4   D
M--
  • 25,431
  • 8
  • 61
  • 93