I need to detect a sequence by group in a data.frame and compute new variable.
Consider I have this following data.frame
:
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
df1
> df1
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 C A,B,C
5 1 5 1 A,B A,B,C
6 1 6 2 A,B,C A,B,C
7 1 7 3 D A,B,C,D
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
I am interested to compute a measure for ID
that follow this sequence:
- Count == 1 - Count > 1 - Count == 1
In the example this is true for:
- rows 2, 3, 4 for `ID==1` - rows 8, 9, 10 for `ID==2` - rows 12, 13, 14 for `ID==3`
For these ID and rows, I need to compute a measure called new
that takes the value of the product
of the last row of the sequence if
it is in the second row of the sequence and NOT in the stock
of the first sequence.
The desired outcome is shown below:
> output
ID seq1 seq2 seq3 new
1 1 2 3 4 C
2 2 1 2 3
3 3 2 3 4 D
Note:
- In the sequence detected for ID no new products are added to the stock.
- In the original data there are a lot of IDs who do not have any sequences.
- Some
ID
have multiple qualifying sequences. All should be recorded. - Count is always 1 or greater.
- The original data holds millions of
ID
with up to 1500 sequences.
How would you write an efficient piece of code to get this output?