3

I'm looking to get the difference in counts by version by app_name. My dataset looks like this: app_name, version_id, count, [difference]

Here is the dataset

    data = structure(list(app_name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), version_id = c(1, 
1.1, 2.3, 2, 3.1, 3.3, 4, 1.1, 2.4), count = c(600L, 620L, 620L, 
200L, 200L, 250L, 250L, 15L, 36L)), .Names = c("app_name", "version_id", 
"count"), class = "data.frame", row.names = c(NA, -9L))

Given this data.frame, how can I get the lagged difference in count by both app_name & version_id? the initial (first) version diff for each app would be zero, since there would be no difference.

Here is an example of what the final results would look like with that final 'diff' column

structure(list(app_name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), version_id = c(1, 
1.1, 2.3, 2, 3.1, 3.3, 4, 1.1, 2.4), count = c(600L, 620L, 620L, 
200L, 200L, 250L, 250L, 15L, 36L), diff = c(0, 20, 0, 0, 0, 1.25, 
0, 0, 2.4)), .Names = c("app_name", "version_id", "count", "diff"
), class = "data.frame", row.names = c(NA, -9L))
Super_John
  • 1,767
  • 2
  • 14
  • 27
  • What did you try so far? –  Oct 10 '15 at 00:55
  • 1
    @Pascal I've been trying to use mutate() to no avail. Following this thread: http://stackoverflow.com/questions/31362397/calculating-the-difference-between-rows-in-a-dataframe-using-dplyr – Super_John Oct 10 '15 at 01:13

2 Answers2

5

Try using dplyr and lag:

library(dplyr)
data %>% group_by(app_name) %>%
         mutate(diffvers = version_id - dplyr::lag(version_id, default = version_id[1]),
                diffcount = count - dplyr::lag(count, default = count[1]))

Source: local data frame [9 x 5]
Groups: app_name [3]

  app_name version_id count diffvers diffcount
    (fctr)      (dbl) (int)    (dbl)     (int)
1        a        1.0   600      0.0         0
2        a        1.1   620      0.1        20
3        a        2.3   620      1.2         0
4        b        2.0   200      0.0         0
5        b        3.1   200      1.1         0
6        b        3.3   250      0.2        50
7        b        4.0   250      0.7         0
8        c        1.1    15      0.0         0
9        c        2.4    36      1.3        21
jeremycg
  • 24,657
  • 5
  • 63
  • 74
0

We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'app_name', loop (lapply(..) the columns specified in the .SDcols, get the difference between the current element and its lag (shift by default has type='lag') and assign (:=) the output to create new columns.

library(data.table)#v1.9.6
setDT(data)[, c('diffvers', 'diffcount') := lapply(.SD, 
              function(x) x-shift(x, fill=x[1L])), by = app_name, .SDcols=2:3]

data
#   app_name version_id count diffvers diffcount
#1:        a        1.0   600      0.0         0
#2:        a        1.1   620      0.1        20
#3:        a        2.3   620      1.2         0
#4:        b        2.0   200      0.0         0
#5:        b        3.1   200      1.1         0
#6:        b        3.3   250      0.2        50
#7:        b        4.0   250      0.7         0
#8:        c        1.1    15      0.0         0
#9:        c        2.4    36      1.3        21
akrun
  • 874,273
  • 37
  • 540
  • 662