2

I have a data.table object with 5M rows. It may look like this:

csvdata <- data.table(timestamp = c(1:6),
                      signal.a=c(12, 12, 13, 12, 12, 14),
                      signal.b=c(7, 7, 7, 7, 8, 8))

timestamp  signal.a  signal.b
        1        12         7
        2        12         7
        3        13         7
        4        12         7
        5        12         8
        6        14         8

What I am trying to do is to remove every row in the table, which does not register any signal change. So I would like to end up with this: Row 2 is deleted, because neither signal.a nor signal.b changed.

timestamp  signal.a  signal.b
        1        12         7
        3        13         7
        4        12         7
        5        12         8
        6        14         8

I have little experience in R, so I tried the usual approach of a for-loop with the intention to mark each row for deletion and later filter out the rows I would like to keep:

for (i in 1:nrow(csvdata)) {
    if (i > 1 && csvdata[i]$signal.a == csvdata[i-1]$signal.a &&
       csvdata[i]$signal.b == csvdata[i-1]$signal.b) {
        csvdata[i]$Drop <- 1
    }
}

The code seems to work, but with 5M rows this code takes forever to run (2h and counting). Is there a more efficient solution?

cddbldot
  • 100
  • 8
  • I can not simply remove duplicates. If the signals may return to 12 and 7 let's say on timestamp 7, I want to keep that row. – cddbldot Jan 24 '17 at 11:45

2 Answers2

3
csvdata[csvdata[, c(TRUE, #always keep first row
                    Reduce("|", #at least one column needs to change
                            Map("!=", 
                                shift(.(signal.a, signal.b)), #shift the columns by one row
                                .(signal.a, signal.b)) #and compare with preceding row
                            )[-1])] #always keep first row
        ,]
#   timestamp signal.a signal.b
#1:         1       12        7
#2:         3       13        7
#3:         4       12        7
#4:         5       12        8
#5:         6       14        8
Roland
  • 127,288
  • 10
  • 191
  • 288
3

One common way to approach this is with rleid, like either (thanks to @Arun's comment) ...

csvdata[!duplicated(rleid(signal.a, signal.b))]

or using @eddi's approach...

csvdata[ csvdata[, .I[1L], by=rleid(signal.a, signal.b)]$V1 ]
Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Wow, very simple. You can even add or remove columns as you like, which is not possible with Roland's approach (which still works fine for me as well). – cddbldot Jan 24 '17 at 14:42
  • @cddbldot I don't understand. What is not possible with my approach? – Roland Jan 25 '17 at 06:58
  • @Roland You might add a `signal.c` to the equation or drop `signal.b` and get the error `x arguments passed to '|' which requires 2`. I didn't need that to solve my problem and I believe with a little more R knowledge I could alter your code as well to do just that. Here you just have to add the column to the list and you are done, though, so it's more universial (or at least intuitive). – cddbldot Jan 25 '17 at 07:29
  • @cddbldot You can easily add columns with my code. If you only consider one column you just need to slightly simplify the code, i.e., remove the call to `|`. So, of course it's possible with my approach (and even easy if you understand the approach). You should not be using code you don't understand. – Roland Jan 25 '17 at 07:39
  • 1
    @Roland your way generalizes if `Reduce(\`|\`, z, init = FALSE)` is used instead of `do.call`? – Frank Jan 25 '17 at 14:47