Calculate difference between two observations in a single case for a number of separate cases in R

Question

I have a dataset similar to the following.

Case <- c("A", "B", "C", "A", "B", "C", "A", "B", "C")
ObservationNumber <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
ObservedValue <- c(154, 152, 157, 173, 176, 171, 203, 205, 199)

And I am trying to create a new data frame where one column is the Case (A, B, or C) and the second column is the change in observed value from the first observation to the third observation. The number of rows would be the number of distinct cases (3 in the example above, but several hundred in the real dataset).

There are workarounds that could work for the example data, but my real data is several hundred cases with 50 observations each. As such, I need some way of doing this that does not require (1) creating a separate dataset for each case (e.g. by filtering) to perform the subtraction on, which is the only way I have been able to come up with on my own.

Does anybody have any other ideas for ways to do this?

**Also in the example the rows are in order, but in the real dataset they are not. As a result, I can’t just take the first and last row in each group; I need to take rows with the maximum and minimum values in each group. Both answers below address this well.

Possible duplicate of [How to select the first and last row within a grouping variable in a data frame?](https://stackoverflow.com/questions/8203818/how-to-select-the-first-and-last-row-within-a-grouping-variable-in-a-data-frame) — lebelinoz, Sep 16 '17 at 20:42
I didn’t find that, but yes they are similar. I will say though that in my case it was not the first and last row, but the row with the minimum and maximum values. — bbernicker, Sep 18 '17 at 00:00

lebelinoz · Accepted Answer · 2017-09-16T20:39:13.020

4

I like to use the dplyr library for these kinds of grouping/summarising questions:

library(dplyr)
df %>% 
    arrange(ObservationNumber) %>%
    group_by(Case) %>% 
    summarise(diff = last(ObservedValue) - first(ObservedValue))

edited Sep 16 '17 at 20:39

answered Sep 16 '17 at 20:30

lebelinoz

4,890
10
33
56

`arrange` is not necessary in this case. – www Sep 16 '17 at 20:31
1

@ycw Don't I need it just in case the data isn't ordered by `ObservationNumber`? – lebelinoz Sep 16 '17 at 20:34
And then I guess it is safer to add `arrange` there. Good! – www Sep 16 '17 at 20:35
Thank you! It didn't occur to me that dplyr would be the right package for this, but this gets the job done and is very simple! – bbernicker Sep 16 '17 at 20:42

Hack-R · Answer 2 · 2017-09-16T20:30:18.193

Case <- c("A", "B", "C", "A", "B", "C", "A", "B", "C")
ObservationNumber <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
ObservedValue <- c(154, 152, 157, 173, 176, 171, 203, 205, 199)

mydat <- as.data.frame(cbind(Case, ObservationNumber, ObservedValue),stringsAsFactors = F)
mydat$ObservedValue <- as.numeric(mydat$ObservedValue)

result <- data.frame()
for(c in unique(mydat$Case)){
  result <- rbind(result, data.frame(Case=c,Diff=  
                    mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == min(mydat$ObservationNumber)] - 
                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == max(mydat$ObservationNumber)]
  ))
}

result

  Case Diff
1    A  -49
2    B  -53
3    C  -42

I also made an example with a bigger data set to make sure it was still fast, which it was:

Case              <- c(rep("A",50), rep("B",50),rep("C",50))
ObservationNumber <- rep(seq(1,50),3)
ObservedValue     <- runif(150)

mydat <- as.data.frame(cbind(Case, ObservationNumber, ObservedValue),stringsAsFactors = F)
mydat$ObservedValue <- as.numeric(mydat$ObservedValue)

result <- data.frame()
for(c in unique(mydat$Case)){
  result <- rbind(result, data.frame(Case=c,Diff=  
                                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == min(mydat$ObservationNumber)] - 
                                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == max(mydat$ObservationNumber)]
  ))
}

... and just to be 100% sure, I scaled it up again to 4 cases with 500 observations each (2000 rows) and it still finished in less than 1 second:

Case              <- c(rep("A",500), rep("B",500),rep("C",500), rep("D",500))
ObservationNumber <- rep(seq(1,500),4)
ObservedValue     <- runif(2000)

mydat <- as.data.frame(cbind(Case, ObservationNumber, ObservedValue),stringsAsFactors = F)
mydat$ObservedValue <- as.numeric(mydat$ObservedValue)

result <- data.frame()
for(c in unique(mydat$Case)){
  result <- rbind(result, data.frame(Case=c,Diff=  
                                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == min(mydat$ObservationNumber)] - 
                                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == max(mydat$ObservationNumber)]
  ))
}

Calculate difference between two observations in a single case for a number of separate cases in R

2 Answers2