7

I am an inexperienced R user and have been struggling with the By() function and would appreciate your help. The task is simple, I have a longitudinal dataset (How do I declare one?) and need to calculate different metrics by ID. One of the the metrics is a simple percent change (that requires a lag, example below):

 ID         Date          Temp        %Change
 AAA    1/1/2003    0.749881714 NA
 AAA    1/2/2003    0.666661576 -0.110977687
 AAA    1/3/2003    0.773079935 0.159628759
 AAA    1/4/2003    0.62902364  -0.186340751
 AAA    1/5/2003    0.733312374 0.165794619
 BBB    1/1/2003    0.707339766 NA
 BBB    1/2/2003    0.764986529 0.081497982
 BBB    1/3/2003    0.662201467 -0.134361925
 BBB    1/4/2003    0.774451765 0.169510798  
 BBB    1/5/2003    0.50829093  -0.343676453
 CCC    1/1/2003    0.836836215 NA
 CCC    1/2/2003    0.837136823 0.00035922
 CCC    1/3/2003    0.809016624 -0.033590924
 CCC    1/4/2003    0.690277509 -0.146769685
 CCC    1/5/2003    0.796357069 0.153676686

Intuitively I understand the use of By(), but haven't been able to produce the correct result (%Change) using a data.frame that contains $ID, $Date, and $Temp. Any suggestions in how to achieve the desired %Change would be greatly appreciated.

Mateo
  • 71
  • 1
  • 3
  • Welcome to SO, Mateo. This is a really nice question with good sample data. – Andrie Jun 28 '11 at 19:38
  • I'm guess that was meant to ask about the `by` function. Correct capitalization (or "not" in this case) is crucial in R. – IRTFM Jun 15 '17 at 20:17

3 Answers3

8

You can use a combination of plyr to handle the group by operation on ID and quantmod has a function for the percentage change named Delt.

require(plyr)
require(quantmod)

> ddply(dat, "ID", transform,  DeltaCol = Delt(Temp))
    ID     Date      Temp    X.Change Delt.1.arithmetic
1  AAA 1/1/2003 0.7498817          NA                NA
2  AAA 1/2/2003 0.6666616 -0.11097769     -0.1109776868
3  AAA 1/3/2003 0.7730799  0.15962876      0.1596287574
4  AAA 1/4/2003 0.6290236 -0.18634075     -0.1863407501
5  AAA 1/5/2003 0.7333124  0.16579462      0.1657946178
6  BBB 1/1/2003 0.7073398          NA                NA
7  BBB 1/2/2003 0.7649865  0.08149798      0.0814979813
8  BBB 1/3/2003 0.6622015 -0.13436192     -0.1343619242
9  BBB 1/4/2003 0.7744518  0.16951080      0.1695107963
10 BBB 1/5/2003 0.5082909 -0.34367645     -0.3436764522
11 CCC 1/1/2003 0.8368362          NA                NA
12 CCC 1/2/2003 0.8371368  0.00035922      0.0003592196
13 CCC 1/3/2003 0.8090166 -0.03359092     -0.0335909235
14 CCC 1/4/2003 0.6902775 -0.14676969     -0.1467696849
15 CCC 1/5/2003 0.7963571  0.15367669      0.1536766860

Alternatively, you can skip the plyr bit, calculate the Delta for the entire data.frame and then update the first row for each ID. There are lots of good ideas about selecting the first row of a data.frame based off of an identifier here. Something like this would probably work:

dat$Delta <- Delt(dat$Temp)
dat[ diff(c(0,dat$ID)) != 0, 5] <- NA

On a related note, if anyone can explain why Delta doesn't seem to accept my plea to give it a reasonable column name, I'd appreciate it.

Community
  • 1
  • 1
Chase
  • 67,710
  • 18
  • 144
  • 161
  • 1
    Interesting combo. I thought one has to have an xts structure for quantmode to work. Also, have been wondering how to pin point first/last obs of a data.frame, the link is extremely useful. – Mateo Jun 30 '11 at 10:57
6

Since the OP specifically asked about using by() I thought I'd provide an answer the illustrates it's use.

First you write a function that acts on each 'piece' of the data frame:

myFun <- function(x){
n <- nrow(x)
x$Change <- c(NA,diff(x$Temp) / head(x$Temp,n-1))
x
}

I've used the base functions diff to calculate the sequential differences in Temp and then since the resulting vector has length n-1, we use head to divide the the differences by all but the last Temp value. (I did this just to work head in since it's handy; there are lots of other ways to do that).

Then the by call:

by(dat,dat$ID,FUN=myFun)

If you want to put all the pieces back together again, we can use do.call and rbind:

do.call(rbind,by(dat,dat$ID,FUN=myFun))
joran
  • 169,992
  • 32
  • 429
  • 468
  • +1, nice answer. Every time I try and use the aggregating functions in base R, I end up feeling like I'm playing with a dull, rusty spork. – Chase Jun 28 '11 at 18:58
  • Thanks for the answer. I didn't think about the diff function....just getting used to do.call()! – Mateo Jun 30 '11 at 10:59
1

Your suggested output is not "%change" (but rather fractional difference) while this illustrates a method getting "percent of original" using the initial value as the basis for the change:

> dat$pctTemp <-  unlist(
            tapply(dat$Temp, dat$ID, function(x) c(NA, 100*x[-1]/x[1]) )
                        )
> dat
    ID     Date      Temp   pctTemp
1  AAA 1/1/2003 0.7498817        NA
2  AAA 1/2/2003 0.6666616  88.90223
3  AAA 1/3/2003 0.7730799 103.09358
4  AAA 1/4/2003 0.6290236  83.88305
5  AAA 1/5/2003 0.7333124  97.79041
6  BBB 1/1/2003 0.7073398        NA
7  BBB 1/2/2003 0.7649865 108.14980
8  BBB 1/3/2003 0.6622015  93.61858
snipped

If you want interval change, you can divide diff(x) by the prceding values

> dat$pctTemp <-  unlist(tapply(dat$Temp, dat$ID, function(x) c(NA, 100*diff(x)/x[-length(x)]) )  )
> dat
    ID     Date      Temp      pctTemp
1  AAA 1/1/2003 0.7498817           NA
2  AAA 1/2/2003 0.6666616 -11.09776868
3  AAA 1/3/2003 0.7730799  15.96287574
4  AAA 1/4/2003 0.6290236 -18.63407501
5  AAA 1/5/2003 0.7333124  16.57946178
6  BBB 1/1/2003 0.7073398           NA
7  BBB 1/2/2003 0.7649865   8.14979813
snipped
IRTFM
  • 258,963
  • 21
  • 364
  • 487