-1

Following is a sample dataframe I have.

    Year - Revenue
    2001  1.23
    2002 23.4
    2003 12.4
    2004 18.0
    ...

I am looking to calculate running stats - for example YoY growth. This would be Revenue[2002] - Revenue[2001].

I can do this using for loops. But is there a base function or anything in plyr to accomplish this more elegantly?

Frank
  • 66,179
  • 8
  • 96
  • 180
QuantQandA
  • 23
  • 6

1 Answers1

2

As suggested diff will do what you are looking for. If your dataset is large or there are groups you can try dplyr.

require(dplyr)

dat <- read.table(header = TRUE, text = "Year Revenue
2001  1.23
2002 23.4
2003 12.4
2004 18.0")

mutate(dat, yoy = Revenue - lag(Revenue))

  Year Revenue    yoy
1 2001    1.23     NA
2 2002   23.40  22.17
3 2003   12.40 -11.00
4 2004   18.00   5.60

Edit: In reply to Eddi's comment. There also seem to be some differences in how data is copied. See output from dplyr's changes below.

> dplyr_dat <- mutate(dat, yoy = Revenue - lag(Revenue))
> dplyr::changes(dat, dplyr_dat)
Changed variables:
          old new        
yoy           0x10d951400

Changed attributes:
          old         new        
names     0x10c3161b8 0x10deeb128
class     0x101ca6568 0x103668108
row.names 0x10c233f88 0x100c98a68
> diff_dat <- within(dat, yoy <- c(NA, diff(Revenue)))
> dplyr::changes(dat, diff_dat)
Changed variables:
          old         new        
Year      0x10c316180 0x11086b9f0
Revenue   0x1036b2120 0x1070c0f28
yoy                   0x110118a40

Changed attributes:
          old         new        
names     0x10c3161b8 0x10c310ff8
class     0x101ca6568 0x10f4ce7a8
row.names 0x10c1d6a38 0x10f7dca78
Vincent
  • 5,063
  • 3
  • 28
  • 39
  • 1
    Would `mutate` perform faster than `diff` on large datasets? FYI @user3273226, the `diff` approach could look like this: `within(dat, yoy <- c(NA, diff(Revenue)))`. – jbaums Feb 05 '14 at 02:07
  • 1
    @jbaums - a microbenchmark of `diff` vs. `mutate` suggests a time improvement of 30ms vs. 15ms for a 400K length vector, 178ms vs. 69ms for a 2M length vector. Not earth shattering differences for day to day operation, but slightly quicker. – thelatemail Feb 05 '14 at 02:44
  • @thelatemail thanks for running the test. I'll put `mutate` on my list of things to remember. – jbaums Feb 05 '14 at 02:48
  • 1
    @thelatemail I'm guessing most (all?) of that difference comes from `diff` being much slower than `lag`. – eddi Feb 05 '14 at 17:06
  • @eddi yes, and we haven't implement `lag()` in C++ yet. Avoiding copies also saves a decent amount of time if you have tens of millions of rows. – hadley Feb 06 '14 at 01:22
  • BTW `changes()` is exported so you don't need `:::` – hadley Feb 06 '14 at 01:23