3

My goal is to do the following:

I have a data.table with > 1 million rows and 4 columns. I want to add a fifth column carrying either a '1' or a '0', depending on the following condition: If there is zero in column 3 followed by any non-zero value in the next row of column 3 (like a sliding window of length 2), a '1' should be added to column 5, otherwise a '0'. The row position of this binary in column 5 should be the same as the non-zero value. Moreover, I want to do this based on a key (i.e. subsets of a column), as the code below shows.

I was unable to implement this in data.table (I have just started using it). I wrote the following code, which does what I want, but it's incredibly slow (>5 s/iteration):

# all_movement_fit_tidy is a data.table with dim 1.2 mio x 4
# set key "fly" in all_movement_fit_tidy
setkey(all_movement_fit_tidy, fly)
genotypes <- unique(all_movement_fit_tidy$fly)
DT_list <- list()
i <- 1
for(genotype in genotypes){

# Get subset of data.table by genotype
genotype_movement <- all_movement_fit_tidy[genotype]

# Calculate initiation list - figure out how to do this in data.table
initiation <- list()
for(j in 1:(nrow(genotype_movement) - 1)){

  initiation[j] <- ifelse(genotype_movement[j, "mm"] == 0 & 
  genotype_movement[j + 1, "mm"] != 0, 1, 0)
}

# Add a 0 to the end of the list to equal number of rows,
# then bind it to genotype_movement
initiation[length(initiation) + 1] <- 0
genotype_movement <- cbind(genotype_movement, as.numeric(initiation))
colnames(genotype_movement)[5] <- "initiation"

# Save this data table in a list of data.tables (DT_list)
DT_list[[i]] <- genotype_movement

i <- i + 1
}

After this, I would bind all entries of the DT_list, which carries all data.tables, into one data.table. The reason why this code is so slow is the following section:

initiation <- list()
for(j in 1:(nrow(genotype_movement) - 1)){

 initiation[j] <- ifelse(genotype_movement[j, "mm"] == 0 & 
 genotype_movement[j + 1, "mm"] != 0, 1, 0)
}

Here, I am looping through each row of the subset of the data.table, and assign the result of the ifelse() function to a list. How can I do this inside the j argument of data.table? I have tried something like:

genotype_movement[, initiation := function(x) for(i in 
1:nrow(genotype_movement) ifelse(.SD[i, "mm"] == 0 & .SD[i + 1, "mm" != 
0, 1, 0)]

But this did not work because the ifelse() function returns single values which cannot be assigned to the initiation column.

pat_krat
  • 179
  • 4
  • 3
    `DT[, col5:= shift(col3) ==0 & col3 != 0]` ? – Hugh Feb 21 '18 at 14:06
  • 1
    I didn't know about shift, this is great, thanks, it works. The code I wrote to achieve this is clearly ridiculous. – pat_krat Feb 21 '18 at 14:21
  • Fyi, there are possibly other ways to tackle this problem, but they're too cumbersome to investigate since you did not provide a reproducible example. For your next question, maybe review http://stackoverflow.com/a/28481250/ – Frank Feb 21 '18 at 15:07
  • 1
    @Frank Thank you, very useful comment, I will take it to heart. – pat_krat Feb 21 '18 at 16:29

1 Answers1

4

You've gone down a rabbit hole. Here's the way up:

DT[, col5:= shift(col3, fill = -1) == 0 & col3 != 0]

# or
DT[, col5:= shift(col3, fill = -1) == 0 & col3 != 0, keyby = key(DT)]
Hugh
  • 15,521
  • 12
  • 57
  • 100
  • Perfect, thank you for a quick and precise solution. – pat_krat Feb 21 '18 at 14:29
  • Just in case anyone visits this page for the same problem: Once Hugh's solution is implemented, if you use keyby = key(DT), NA's will be introduced at the beginning of each column subset that satisfies col3 != 0, because shift() cannot lag behind the beginning of the column. However, if col3 != 0 is FALSE, no NA will be introduced, but FALSE instead, because presumably, shift() doesn't execute in that case. Just to be aware of this difference. – pat_krat Feb 21 '18 at 14:48
  • Hey @pat_krat, you may want to consider using the `fill` argument in my edit to `shift` to avoid any values in the first entry for each key`. – Hugh Feb 21 '18 at 14:50