2

I have data that looks like this :

    char_column   date_column1 date_column2 integer_column

415  18JT9R6EKV   2014-08-28   2014-09-06              1
26   18JT9R6EKV   2014-12-08   2014-12-11              2
374  18JT9R6EKV   2015-03-03   2015-03-09              1
139  1PEGXAVCN5   2014-05-06   2014-05-10              3
969  1PEGXAVCN5   2014-06-11   2014-06-15              2
649  1PEGXAVCN5   2014-08-12   2014-08-16              3

I want to perform a loop that would check every row against the preceding row, and given certain conditions assign them the same number (so I can group them later) , the point is that if the date segments are close enough I would collapse them into one segment.

my attempt is the following :

i <- 1
z <- 1
v <- 1
for (i in 2:nrow(df)){
   z[i] <-  ifelse(df[i,'char_column'] == df[i-1,'char_column'],
              ifelse((df[i,'date_column1'] - df[i-1,'date_column2']) <= 5, 
                     ifelse(df[i,'integer_column'] == df[i-1,'integer_column'], 
                            v, v<- v+1),
                     v <- v+1),
              v <- v+1)}

df$grouping <- z

then I would just group using min(date_column1) and max(date_column2).

this method works perfectly for say 100,000 rows (22.86 seconds) but for a million rows : 33.18 minutes!! I have over 60m rows to process, is there a way I can make the process more efficient ?

PS: to generate a similar table you can use the following code :

x <- NULL
for (i in 1:200) { x[i] <- paste(sample(c(LETTERS, 1:9), 10), collapse =   '')}
y <- sample((as.Date('2014-01-01')):as.Date('2015-05-01'), 1000, replace = T)
y2 <- y + sample(1:10)
df <- data.frame(char_column = sample(x, 1000, rep = T),
             date_column1 = as.Date(y, origin = '1970-01-01'),
             date_column2 = as.Date(y2,origin = '1970-01-01'),
             integer_column = sample(1:3,1000, replace = T),
             row.names = NULL)

df <- df[order(df$char_column, df$date_column1),]
Mouad_Seridi
  • 2,666
  • 15
  • 27
  • 4
    A quick fix for some speed gains is to preallocate your storage vector, z. Add `z <- numeric(nrow(df)); z[1] <- 1`. – lmo Dec 12 '16 at 16:33
  • When creating an example with a random process please use `set.seed` to get consistent results in solutions. – Pierre L Dec 12 '16 at 16:33
  • Could you provide an example of the desired output? – Jason Morgan Dec 12 '16 at 16:41
  • 1
    @JasonMorgan Since this is an efficiency question, the desired output is equal to the output of the original script. Hopefully, just faster. – Pierre L Dec 12 '16 at 16:43
  • @PierreLafortune That's all fine, but sometimes these questions have straightforward solutions that can be answered without running any code. Having the desired output would make it much easier to determine whether this is the case. What is someone is answering from their phone? – Jason Morgan Dec 12 '16 at 16:46
  • seems like `df$grouping <- data.table::rleid(df$char_column, df$date_column1-df$date_column2<=5, df$integer_column)` – Khashaa Dec 12 '16 at 17:59
  • @lmo thanks for the tip, it speeds things up considerably once the number of rows increases to 100K from 22 seconds to just over 6s, – Mouad_Seridi Dec 12 '16 at 18:02
  • @Jason the desired output would be something like z = 1,2,3,3,3,3,4,5,..... with the length equal to the number of rows of my data frame. – Mouad_Seridi Dec 12 '16 at 18:02
  • , @Pierre : thanks for the tip. – Mouad_Seridi Dec 12 '16 at 18:02
  • subsetting data.frame is often a major bottleneck: assign them vectors `char_col1 <- df[, 'char_column']; date_col1 <- df[, 'date_column1']; date_col2 <- df[, 'date_column2']` and use vectors in code. See for instance, [here](http://stackoverflow.com/a/34826252/4380497) – Khashaa Dec 12 '16 at 18:08
  • @Khashaa does your data.table answer takes into effect that i'm checking one row against the one preceding it ? in my code the condition is `df[i,'date_column1'] - df[i-1,'date_column2'] <= 5` – Mouad_Seridi Dec 12 '16 at 18:12
  • 1
    @user4970610 data.table has `shift` function `df$grouping <- rleid(df$char_column, df$date_column1-shift(df$date_column2)<=5, df$integer_column==shift(df$integer_column))` maybe? – Khashaa Dec 12 '16 at 18:18
  • @Khashaa yeah that doesn't work ..... – Mouad_Seridi Dec 12 '16 at 19:22
  • 1
    @user4970610 posted a solution that might suit your needs – Khashaa Dec 12 '16 at 20:25

1 Answers1

3

Since data.table::rleid does not work, I post another (hopefully) fast solution

1. Get rid of nested ifelse

ifelse is often slow, especially for scalar evaluation, use if.

Nested ifelse should be avoided whenever possible: observe that ifelse(A, ifelse(B, x, y), y) can be suitably replaced by if (A&B) x else y

f1 <- function(df){
  z <- rep(NA, nrow(df))
  z[1] <- 1
  char_col <- df[, 'char_column']
  date_col1 <- df[, 'date_column1']
  date_col2 <- df[, 'date_column2']
  int_col <- df[, 'integer_column']
  for (i in 2:nrow(df)){
    if((char_col[i] == char_col[i-1])&((date_col1[i] - date_col2[i-1]) <= 5)&(int_col[i] == int_col[i-1]))
    {
      z[i] <- z[i-1]
    }
    else 
    {
      z[i] <- z[i-1]+1
    }
  }
  z
}

f1 is about 40% faster than the original solution for 10.000 rows.

system.time(f1(df))
   user  system elapsed 
   2.72    0.00    2.79 

2. Vectorize

Upon closer inspection the conditions inside if can be vectorized

library(data.table)
f2 <- function(df){
  z <- rep(NA, nrow(df))
  z[1] <- 1
  char_col <- df[, 'char_column']
  date_col1 <- df[, 'date_column1']
  date_col2 <- df[, 'date_column2']
  int_col <- df[, 'integer_column']
  cond <- (char_col==shift(char_col))&(date_col1 - shift(date_col2) <= 5)&(int_col==shift(int_col))
  for (i in 2:nrow(df)){
    if(cond[i])
    {
      z[i] <- z[i-1]
    }
    else 
    {
      z[i] <- z[i-1]+1
    }
  }
  z
} 
# for 10000 rows
system.time(f2(df))
#   user  system elapsed 
#   0.01    0.00    0.02 

3. Vectorize, Vectorize

While f2 is already quite fast, a further vectorization is possible. Observe how z is calculated: cond is a logical vector, and z[i] = z[i-1] + 1 when cond is FALSE. This is none other than cumsum(!cond).

f3 <- function(df){
  setDT(df)
  df[, cond := (char_column==shift(char_column))&(date_column1 - shift(date_column2) <= 5)&(integer_column==shift(integer_column)),]
  df[, group := cumsum(!c(FALSE, cond[-1L])),]
} 

For 1M rows

system.time(f3(df))
#   user  system elapsed 
#   0.05    0.05    0.09 
system.time(f2(df))
#   user  system elapsed 
#   1.83    0.05    1.87 
Khashaa
  • 7,293
  • 2
  • 21
  • 37