4

I have a table as following:

id   name  amount  year
001  A     10      2010
001  A     10      2011
001  A     12      2012
-----------------------
002  A     3       2012
002  A     4       2013
-----------------------
003  B     20      2011
003  B     20      2012

(Note two entities have the same name A but they are different, id is the unique identifier.)

I want to compute the increment in amount over the previous year, i.e. the result should look like:

id   name   increment   year
001  A      0           2010
001  A      0           2011
001  A      2           2012
----------------------------
002  A      0           2012
002  A      1           2013
----------------------------
003  B      0           2011
003  B      0           2012

Note that the increment of the first year is considered "0".


In MSSQL, it can implemented by:

SELECT id,
       name,
       amount - LAG(amount, 1, amount) OVER (PARTITION BY id ORDER BY YEAR) as increment,
       year
FROM table

I am trying to accomplish the task in R with data.table. I found an succinct example here:
DT[, increment := amount - shift(amount, 1), by=id]. But error was prompted: could not find function "shift".

The versions are:

  • R: 3.2.0_1
  • data.table: 1.9.4

The questions are:

  1. I found the shift function implemented on data.table's Github, why I failed to invoke the function?
  2. I think that by in data.table is equivalent to PARTITION BY in SQL, then what is the counterpart of ORDER BY in R? Do I have to set the key of data.table before carrying out any aggregation so the data.table is ordered?
Community
  • 1
  • 1
Zelong
  • 2,476
  • 7
  • 31
  • 51
  • 1
    `DT[, increment := c(0L, diff(amount)), by = id]`. If you want to use `shift` you need to install the development version from GH, see [here](https://github.com/Rdatatable/data.table/wiki/Installation) – David Arenburg Jun 04 '15 at 10:28
  • Using `shift`, it would be something like `DT[, increment := amount - shift(amount, 1L, fill = amount[1L]), by = id]` – David Arenburg Jun 04 '15 at 10:34
  • See this question with five different methods http://stackoverflow.com/questions/30579699/collapse-dataframe-by-name-where-integer-columns-are-summed-and-factor-columns/30580426#30580426 – Pierre L Jun 04 '15 at 10:41
  • 1
    @plafort how is that link related? – David Arenburg Jun 04 '15 at 10:42
  • It answers a problem related to 'how do I do a calculation on a column grouped by this variable'? – Pierre L Jun 04 '15 at 10:46
  • @plafort OP posted a much more related link in his question. – David Arenburg Jun 04 '15 at 10:47
  • @DavidArenburg got it, I didn't see the link. In my view, this is essentially a split-apply-combine task.The question I linked to shows methods of approaching them that could solve this problem and others like it. – Pierre L Jun 04 '15 at 11:12
  • @DavidArenburg I added an answer, please advise or suggest if I am off with my approach. I learned most of the techniques from your posts. – Pierre L Jun 04 '15 at 13:46
  • @plafort I think that rather than `aggregate`, `tapply` and `summarise` OP is looking for `:=` (in `data.table`), `mutate` (in `dplyr`) and `ave` (in base R). – David Arenburg Jun 04 '15 at 13:51
  • Thank you! I add more. I like the expressions that replace the column as opposed to just adding a new one like `mutate` and `:=`. So many options though : ) @DavidArenburg – Pierre L Jun 04 '15 at 14:03
  • @plafort I'm not sure what you mean. All your options added new columns. In `data.table` we are adding new columns using `:=` (see my first two comments). What you did is copying all the columns within the data *and* created a new column. I'm not sure how is it better. – David Arenburg Jun 04 '15 at 14:10
  • I was referring to the finished product having the same number of columns as the original, which is the desired output of the OP. @DavidArenburg – Pierre L Jun 04 '15 at 14:24
  • @OP #1 You should upgrade to 1.9.5 (instructions are on the github page David linked in the first comment) to use `shift`; #2 It is not necessary to set a key before using `by`. If you want the results sorted by the by column, use `keyby=id` in its place. If at some point you want to reorder the table without doing anything else, you can use `setorder`. – Frank Jun 04 '15 at 17:37
  • @plafort Your `fun` is not as general as `shift`. I think it makes the most sense simply to upgrade to 1.9.5 so they have access to `shift`, which they already understand how to use. – Frank Jun 04 '15 at 17:38
  • @DavidArenburg I see, thank you very much. – Zelong Jun 04 '15 at 20:55
  • @Frank Thank you. I read the manual of data.table. It says that "`keyby=` is as `by=` then `setkey`", so the `setkey` is applied to the result afterwards. Assume that the `id` and `year` are unordered, do I need to sort the `id` and `year` to ensure the subtraction is based on previous year, before I generate the new column `increment`? – Zelong Jun 04 '15 at 20:59
  • 1
    Yes, sorting year is necessary to ensure the previous entry corresponds to the previous year. If you upgrade to get the shift function, something like this should work: `DT[order(year), increment := amount - shift(amount, 1), by=id]` which is almost the same as the code you have posted. If you don't upgrade (for example, I can't at work), David's answer in the first comment should work with `order` inserted as well. – Frank Jun 04 '15 at 21:04

1 Answers1

4

This case falls under a general structure of doing an operation on a column by a separate grouping column.

fun <- function(v) c(0, diff(v)) #to take the difference and account for the starting value

#function tapply()
df1 <- df
df1$amount <- unlist(with(df, by(amount, id, fun)))
df1
   id name amount year
1 001    A      0 2010
2 001    A      0 2011
3 001    A      2 2012
4 002    A      0 2012
5 002    A      1 2013
6 003    B      0 2011
7 003    B      0 2012

#using data.table
df2 <- df
setDT(df2)[, list(name, Increment = fun(amount), year), by = id]
    id name Increment year
1: 001    A         0 2010
2: 001    A         0 2011
3: 001    A         2 2012
4: 002    A         0 2012
5: 002    A         1 2013
6: 003    B         0 2011
7: 003    B         0 2012

#function: by()
df3 <- df
df3$amount <- unlist(with(df3, by(amount, id, fun)))
df3
   id name amount year
1 001    A      0 2010
2 001    A      0 2011
3 001    A      2 2012
4 002    A      0 2012
5 002    A      1 2013
6 003    B      0 2011
7 003    B      0 2012

#using dplyr with data.table
DT %>%
  group_by(id) %>%
  summarise(name, increment = fun(amount), year)
Source: local data table [7 x 4]

   id name increment year
1 001    A         0 2010
2 001    A         0 2011
3 001    A         2 2012
4 002    A         0 2012
5 002    A         1 2013
6 003    B         0 2011
7 003    B         0 2012

#using aggregate
df5$amount <- unlist(aggregate(amount ~ id, data=df5, FUN=fun)$amount)
df5
   id name amount year
1 001    A      0 2010
2 001    A      0 2011
3 001    A      2 2012
4 002    A      0 2012
5 002    A      1 2013
6 003    B      0 2011
7 003    B      0 2012

#function: ave
df6 <- df
df6$amount <- with(df, ave(amount, id, FUN-fun))
df6
   id name amount year
1 001    A      0 2010
2 001    A      0 2011
3 001    A      2 2012
4 002    A      0 2012
5 002    A      1 2013
6 003    B      0 2011
7 003    B      0 2012

#dplyr (non-data.table)
df7 <- df
df %>%
  group_by(id) %>%
  mutate(increment = fun(amount))
   id name amount year increment
1 001    A     10 2010         0
2 001    A     10 2011         0
3 001    A     12 2012         2
4 002    A      3 2012         0
5 002    A      4 2013         1
6 003    B     20 2011         0
7 003    B     20 2012         0

#dplyr (with extra command 'select' to give the desired output of the OP)
df %>%
   group_by(id) %>%
     mutate(increment = fun(amount)) %>%
       select(id, name, increment, year)
Source: local data frame [7 x 4]
Groups: id

   id name increment year
1 001    A         0 2010
2 001    A         0 2011
3 001    A         2 2012
4 002    A         0 2012
5 002    A         1 2013
6 003    B         0 2011
7 003    B         0 2012

Data

df <- data.frame(id=factor(c('001', '001', '001', '002', '002', '003', '003')), 
                 name=c(rep('A', 5), rep('B', 2)),
                 amount=c(10,10,12,3,4,20,20),
                 year=c(2010, 2011, 2012, 2012, 2013, 2011, 2012)
)
Pierre L
  • 28,203
  • 6
  • 47
  • 69