R - building new variables from sequenced data

Question

This is an update / follow-up on this question. The answer outlined their doesn't meet the new requirements.

I am looking for an efficient way (data.table?) to construct two new measures for each ID.

Measure 1 and Measure 2 needs to meet the following conditions:

Condition 1: Find a sequence of three rows for which:

the first count > 0
the second `count >1' and
the third count ==1.

Condition 2 for Measure 1:

takes the value of the elements in product of the third row of the sequence that are:
in the product of second row of sequence and
NOT in the stock of the first row in sequence.

Condition 2 for measure 2:

takes the value of the elements in product of the last row of the sequence that are:
NOT in the product of second row of sequence
NOT in the stock of the first row in sequence.

Data:

df2 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
              seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
              count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
              product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
              stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))

> df2
   ID seqs count product     stock
1   1    1     2       A         A
2   1    2     1       B       A,B
3   1    3     3       C     A,B,C
4   1    4     1   A,C,E   A,B,C,E
5   1    5     1     A,B   A,B,C,E
6   1    6     2   A,B,C   A,B,C,E
7   1    7     3       D A,B,C,D,E
8   2    1     1       A         A
9   2    2     2       B       A,B
10  2    3     1       A       A,B
11  3    1     3       A         A
12  3    2     1   A,B,C     A,B,C
13  3    3     4       D   A,B,C,D
14  3    4     1       D   A,B,C,D

The desired output looks like this:

   ID seq1 seq2 seq3 measure1   measure2
1:  1    2    3    4   C         E 
2:  2    1    2    3    
3:  3    2    3    4   D

How would you code this?

score 1 · Accepted Answer · answered Aug 16 '19 at 19:12

Few things you need to know to be able to do this:

shift function to compare values in your groups
separate_rows function to split your strings to get to the normalised data view.

library(data.table)
dt <- data.table(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
                  seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
                  count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
                  product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
                  stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))

dt[, count.2 := shift(count, type = "lead")]
dt[, count.3 := shift(count, n = 2, type = "lead")]

dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]


dt <- dt[count > 0 & count.2 > 1 &  count.3 == 1]
dt <- unique(dt, by = "ID")

library(tidyr)
dt.measure <- separate_rows(dt, product.3, sep = ",")
dt.measure <- separate_rows(dt.measure, stock, sep = ",")
dt.measure <- separate_rows(dt.measure, product, sep = ",")

dt.measure[, measure.1 := (product.3 == product.2 & product.3 != stock)]
dt.measure[, measure.2 := (product.3 != product.2 & product.3 != stock)]
res <- dt.measure[, 
  .(
    measure.1 = max(ifelse(measure.1, product.3, NA_character_), na.rm = TRUE), 
    measure.2 = max(ifelse(measure.2, product.3, NA_character_), na.rm = TRUE)
  ),
  ID
]

dt <- merge(dt, res, by = "ID")
dt[, .(ID, measure.1, measure.2)]
# ID measure.1 measure.2
# 1:  1         C         E
# 2:  2      <NA>      <NA>
# 3:  3         D      <NA>

This approach works well for the mock data. But in the real data (500,000+ rows), the product and stock columns are filled with hundreds of items. As a result, the `dt` gets extremely large when we `separate_rows` and my local machine (128GB) runs out of memory. Is there an alternative for `separate_rows`? Perhaps a `regex` solution? — wake_wake, Aug 20 '19 at 15:20
Well, there are few things that you can do: 1. partition the data and solve problem in chunks (aka make it small) 2. put this data into a database like BigQuery (aka delegate) and use SQL. 3. Check some other packages, e.g. answer here https://stackoverflow.com/questions/34712949/split-column-in-data-table-to-multiple-rows — Bulat, Aug 21 '19 at 18:46
Also see this for D3 method: https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows — Bulat, Aug 21 '19 at 18:48

score 1 · Answer 2 · answered Aug 16 '19 at 20:24

I'm not sure what the criteria for efficient is, but here's an approach using embed and tidyverse style. It filters down so you are working with less and less.

Loading up the data and packages (note later on setdiff and intersect are from dplry)

library(purrr)
library(dplyr)

df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
                  seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
                  count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
                  product = c("A", "B", "C", "A,C,E", "A,B", 
                              "A,B,C", "D", "A", "B", "A", "A", 
                              "A,B,C", "D", "D"),
                  stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", 
                            "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", 
                            "A,B,C", "A,B,C,D", "A,B,C,D"),
                  stringsAsFactors = FALSE)

Define a helper function to evaluate condition 1

meetsCond1 <- function(rseg) { 

  seg <- rev(rseg)

  all(seg[1] > 0, seg[2] > 1, seg[3] == 1)

}

The embed function warps a time series into a matrix where essentially each row is a window of the length of interest. Using apply, you filter down to which rows start relevant sequences.

cond1Match<- embed(df1$count, 3) %>%
  apply(1, meetsCond1) %>%
  which()

You can translate that back to final products, the previous products, and stock rows of interest to determine the measures by adding offsets. Split them into a list of individual components.

finalProds <- df1$product[cond1Match + 2] %>%
  strsplit(",")
prevProds <- df1$product[cond1Match + 1] %>%
  strsplit(",")
initialStock <- df1$stock[cond1Match] %>%
  strsplit(",")

For both measures, neither of them can be in the stock.

notStock <- map2(finalProds, initialStock, ~.x[!(.x %in% .y)])

Then generate your data.frame by retrieving the seqs and ID values of the window. The measures then are just the intersect and setdiff of the final products with those in the previous rows.

data.frame(ID = df1$ID[cond1Match],
           seq1 = df1$seqs[cond1Match], 
           seq2 = df1$seqs[cond1Match + 1],
           seq3 = df1$seqs[cond1Match + 2],
           measure1 = imap_chr(notStock, 
                               ~intersect(.x, prevProds[[.y]]) %>%
                               {if(length(.) == 0) "" else paste(., sep = ",")}

           ),
           measure2 = imap_chr(notStock, 
                               ~setdiff(.x, prevProds[[.y]]) %>%
                               {if(length(.) == 0) "" else paste(., sep = ",")}

           ),
           stringsAsFactors = FALSE
) %>%
  slice(match(unique(ID), ID))

which yields the desired output, which seems to limit at most one line per ID. In the original post, you specify you want all reported. Removing the slice call would then instead yield

#>   ID seq1 seq2 seq3 measure1 measure2
#> 1  1    2    3    4        C        E
#> 2  1    6    7    1                  
#> 3  2    1    2    3                  
#> 4  2    3    1    2                 C
#> 5  3    2    3    4        D

If you're looking to really squeeze efficiency, you might be able to gain some by placing the definitions of finalProds, prevProds, and initialStock instead of assigning them to variables first. I would imagine unless your set of matches is really large, it would be negligible.

score 1 · Answer 3 · answered Aug 16 '19 at 23:30

A rolling window approach using data.table with base R code in j:

library(data.table)
cols <- c("product", "stock")
setDT(df2)[, (cols) := lapply(.SD, function(x) strsplit(as.character(x), split=",")), .SDcols=cols]

ans <- df2[, 
    transpose(lapply(1L:(.N-2L), function(k) {
        if(count[k]>0 && count[k+1L]>1 && count[k+2L]==1) {
            m1 <- setdiff(intersect(product[[k+2L]], product[[k+1L]]), stock[[k]])
            m2 <- setdiff(setdiff(product[[k+2L]], product[[k+1L]]), stock[[k]])
            c(seq1=seqs[k], seq2=seqs[k+1L], seq3=seqs[k+2L],
                measure1=if(length(m1) > 0) paste(m1, collapse=",") else "",
                measure2=if(length(m2) > 0) paste(m2, collapse=",") else "")
        }
    }), ignore.empty=TRUE),
    ID]
setnames(ans, names(ans)[-1L], c(paste0("seq", 1:3), paste0("measure", 1:2)))
ans

output:

   ID seq1 seq2 seq3 measure1 measure2
1:  1    2    3    4        C        E
2:  2    1    2    3                  
3:  3    2    3    4        D

This is interesting. Is the rolling window efficient? That is, does it scale well to data with 500.000+ rows? — wake_wake, Aug 19 '19 at 20:07
I think rolling window is required for your problem. There might be more efficient method than this. You might want to try on your actual dataset. — chinsoon12, Aug 19 '19 at 22:08

R - building new variables from sequenced data

3 Answers3

Linked