2

I have a dataset similar to the following.

Case <- c("A", "B", "C", "A", "B", "C", "A", "B", "C")
ObservationNumber <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
ObservedValue <- c(154, 152, 157, 173, 176, 171, 203, 205, 199)

And I am trying to create a new data frame where one column is the Case (A, B, or C) and the second column is the change in observed value from the first observation to the third observation. The number of rows would be the number of distinct cases (3 in the example above, but several hundred in the real dataset).

There are workarounds that could work for the example data, but my real data is several hundred cases with 50 observations each. As such, I need some way of doing this that does not require (1) creating a separate dataset for each case (e.g. by filtering) to perform the subtraction on, which is the only way I have been able to come up with on my own.

Does anybody have any other ideas for ways to do this?

**Also in the example the rows are in order, but in the real dataset they are not. As a result, I can’t just take the first and last row in each group; I need to take rows with the maximum and minimum values in each group. Both answers below address this well.

bbernicker
  • 158
  • 1
  • 14
  • Possible duplicate of [How to select the first and last row within a grouping variable in a data frame?](https://stackoverflow.com/questions/8203818/how-to-select-the-first-and-last-row-within-a-grouping-variable-in-a-data-frame) – lebelinoz Sep 16 '17 at 20:42
  • I didn’t find that, but yes they are similar. I will say though that in my case it was not the first and last row, but the row with the minimum and maximum values. – bbernicker Sep 18 '17 at 00:00

2 Answers2

4

I like to use the dplyr library for these kinds of grouping/summarising questions:

library(dplyr)
df %>% 
    arrange(ObservationNumber) %>%
    group_by(Case) %>% 
    summarise(diff = last(ObservedValue) - first(ObservedValue))
lebelinoz
  • 4,890
  • 10
  • 33
  • 56
2
Case <- c("A", "B", "C", "A", "B", "C", "A", "B", "C")
ObservationNumber <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
ObservedValue <- c(154, 152, 157, 173, 176, 171, 203, 205, 199)

mydat <- as.data.frame(cbind(Case, ObservationNumber, ObservedValue),stringsAsFactors = F)
mydat$ObservedValue <- as.numeric(mydat$ObservedValue)

result <- data.frame()
for(c in unique(mydat$Case)){
  result <- rbind(result, data.frame(Case=c,Diff=  
                    mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == min(mydat$ObservationNumber)] - 
                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == max(mydat$ObservationNumber)]
  ))
}

result
  Case Diff
1    A  -49
2    B  -53
3    C  -42

I also made an example with a bigger data set to make sure it was still fast, which it was:

Case              <- c(rep("A",50), rep("B",50),rep("C",50))
ObservationNumber <- rep(seq(1,50),3)
ObservedValue     <- runif(150)

mydat <- as.data.frame(cbind(Case, ObservationNumber, ObservedValue),stringsAsFactors = F)
mydat$ObservedValue <- as.numeric(mydat$ObservedValue)

result <- data.frame()
for(c in unique(mydat$Case)){
  result <- rbind(result, data.frame(Case=c,Diff=  
                                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == min(mydat$ObservationNumber)] - 
                                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == max(mydat$ObservationNumber)]
  ))
}

... and just to be 100% sure, I scaled it up again to 4 cases with 500 observations each (2000 rows) and it still finished in less than 1 second:

Case              <- c(rep("A",500), rep("B",500),rep("C",500), rep("D",500))
ObservationNumber <- rep(seq(1,500),4)
ObservedValue     <- runif(2000)

mydat <- as.data.frame(cbind(Case, ObservationNumber, ObservedValue),stringsAsFactors = F)
mydat$ObservedValue <- as.numeric(mydat$ObservedValue)

result <- data.frame()
for(c in unique(mydat$Case)){
  result <- rbind(result, data.frame(Case=c,Diff=  
                                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == min(mydat$ObservationNumber)] - 
                                       mydat$ObservedValue[mydat$Case==c & mydat$ObservationNumber == max(mydat$ObservationNumber)]
  ))
}
Hack-R
  • 22,422
  • 14
  • 75
  • 131