0

Suppose I have a data frame with 2 variables which I'm trying to run some basic summary stats on. I would like to run a loop to give me the difference between minimum and maximum seconds values for each unique value of number. My actual data frame is huge and contains many values for 'number' so subsetting and running individually is not a realistic option. Data looks like this:

df <- data.frame(number=c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4,5,5,5,5),
                 seconds=c(1,4,8,1,5,11,23,1,8,1,9,11,24,44,112,1,34,55,109)) 
     number seconds
1       1       1
2       1       4
3       1       8
4       2       1
5       2       5
6       2      11
7       2      23
8       3       1
9       3       8
10      4       1
11      4       9
12      4      11
13      4      24
14      4      44
15      4     112
16      5       1
17      5      34
18      5      55
19      5     109

my current code only returns the value of the difference between minimum and maximum seconds for the entire data fram:

ZZ <- unique(df$number)
for (i in ZZ){
      Y <- max(df$seconds) - min(df$seconds) 
}
Frank
  • 66,179
  • 8
  • 96
  • 180
Jojo
  • 4,951
  • 7
  • 23
  • 27
  • Why do you need a loop? Aggregate might work better here. Or any of the 'do something by something' libraries like dplyr or data.table. – Heroka Nov 04 '15 at 15:15
  • Thank you @Heroka. Although the code below does exactly what I want it to this thread should prove useful. – Jojo Nov 04 '15 at 15:22

1 Answers1

3

Since you have a lot of data performance should matter and you should use a data.table instead of a data.frame:

library(data.table)
dt <- as.data.table(df)
dt[, .(spread = (max(seconds) - min(seconds))), by=.(number)]

   number spread
1:      1      7
2:      2     22
3:      3      7
4:      4    111
5:      5    108
R Yoda
  • 8,358
  • 2
  • 50
  • 87