6

I want to calculate how long its been since something occurred.

Given the following, you can see that the light is on some of the time, but not all of the time. I want to normalize the data to feed it to a neural network.

library(data.table)
d<-data.table(
    date = c("6/1/2013", "6/2/2013","6/3/2013","6/4/2013"),
    light = c(TRUE,FALSE,FALSE,TRUE) 
)
d
       date light
1: 6/1/2013  TRUE
2: 6/2/2013 FALSE
3: 6/3/2013 FALSE
4: 6/4/2013  TRUE

what I'd like to calculate is another column that shows the "distance" to the last occurrence.

so for the data above: first row, since its on it should be zero second row, should be 1 third row, should be 2 fourth row, should be zero

Cœur
  • 37,241
  • 25
  • 195
  • 267
eAndy
  • 323
  • 2
  • 9

3 Answers3

5

I would suggest creating a grouping column based on when there is a switch from FALSE to TRUE:

# create group column
d[c(light), group := cumsum(light)]
d[is.na(group), group:=0L]
d[, group := cumsum(group)]
d

Then simply tally by group, using cumsum and negating light:

d[, distance := cumsum(!light), by=group]

# remove the group column for cleanliness
d[, group := NULL]

Results:

d

         date light distance
1: 2013-06-01  TRUE        0
2: 2013-06-02 FALSE        1
3: 2013-06-03 FALSE        2
4: 2013-06-04  TRUE        0
5: 2013-06-05  TRUE        0
6: 2013-06-06 FALSE        1
7: 2013-06-07 FALSE        2
8: 2013-06-08  TRUE        0

I added a few rows

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • why the recommendation over the other seemingly simpler solutions? Is it a perf consideration? More correct in some manner? – eAndy Jul 10 '13 at 01:23
  • 1
    nope, just how it occurred to me to approach the problem. Personally, I like @eddi's solutions – Ricardo Saporta Jul 10 '13 at 01:49
4

This should do it:

d[, distance := 1:.N - 1, by = cumsum(light)]

or this:

d[, distance := .I - .I[1], by = cumsum(light)]

And if you want to actually count number of days as opposed to row-distance, you could use:

d[, distance := as.numeric(as.POSIXct(date, format = "%m/%d/%Y") -
                           as.POSIXct(date[1], format = "%m/%d/%Y"),
                           units = 'days'),
    by = cumsum(light)]
eddi
  • 49,088
  • 6
  • 104
  • 155
2

An approach using run length encoding (rle) and sequence (which is a wrapper for unlist(lapply(nvec, seq_len))

d[, distance := sequence(rle(light)$lengths)][(light), distance := 0]
mnel
  • 113,303
  • 27
  • 265
  • 254