0

Goodmorning StackOverflow,

I have seen answer to similar question, however, they do not consider the group_ID and are not efficient enough to be run on massive datasets.

I am struggling to find a solution to the following task: within the consecutive elements of each group_ID, recursively compute the difference with the previous element starting from the second to the last element belonging to that group_ID.

Therefore, considering the following sample data:

data <- data.frame(time = c(1:3, 1:4),
                   group_ID = c(rep(c("1", "2"), c(3, 4))),
                   value = c(0, 400, 2000, 0, 500, 2000, 2120))

The expected result of the solution I am trying to find is:

solution_df <- data.frame(time = c(1:3, 1:4),
                          group_ID = c(rep(c("1", "2"), c(3, 4))),
                          difference = c(NA, 400, 1600, NA, 500, 1500, 120))

It is critical to bear in mind the dataset is massive and the solution must be efficient.

I hope the question was clear, otherwise please ask for further details.

Seymour
  • 3,104
  • 2
  • 22
  • 46

1 Answers1

1

You could use data.table for grouping and diff to calculate the differences.

library(data.table)
setDT(data)
data[, .(time = time,
  difference = c(NA, diff(value))), by = group_ID]

#   group_ID time difference
#1:        1    1         NA
#2:        1    2        400
#3:        1    3       1600
#4:        2    1         NA
#5:        2    2        500
#6:        2    3       1500
#7:        2    4        120

I don't know what is supposed to be recursive here.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • Thank you. With recursive I meant for each group. What was the correct term? – Seymour Dec 07 '17 at 11:38
  • following your data.table solution, can I change `difference()` with `distGeo()` which take the latitude and logitude between two adjacent records and return the difference in terms of distance? – Seymour Dec 07 '17 at 11:42
  • Probably. But your speed bottleneck might be that function (which I don't know) then. Have you done some profiling? – Roland Dec 07 '17 at 11:46
  • Apparently it is not possible to apply `distGeo` instead of `diff()` because the first takes two argument whereas `diff()` compute the difference between records by itself. – Seymour Dec 07 '17 at 11:49
  • If you have a different question please ask a different question. I believe my answer fully addresses your question as asked. – Roland Dec 07 '17 at 11:53
  • You definetly did. I was just trying to understand whether this same solution could be used to solve also another question :) – Seymour Dec 07 '17 at 11:56