2

In reality I have a very large data frame. One column contains an ID and another contains a value associated with that ID. However, each ID occurs multiple times with differing values, and I wish to record the maximum value for each ID while discarding the rest. Here is a replicable example using the quakes dataset in R:

data <- as.data.frame(quakes)

##Create output matrix
output <- matrix(,length(unique(data[,5])),2)
colnames(output) <- c("Station ID", "Max Mag")

##Grab unique station IDs
uni <- unique(data[,5])

##Go through each station ID and record the maximum magnitude
for (i in 1:dim(output)[1])
{
  sub.data <- data[which(data[,5]==uni[i]),]
  ##Put station ID in column 1
  output[i,1] <- uni[i]
  ##Put biggest magnitude in column 2
  output[i,2] <- max(sub.data[,4])
}

Considering that with my real data I have data frames with dimensions of 100000's of rows, this is a slow process. Is there a quicker way to execute such a task?

Any help much appreciated!

JPD
  • 2,561
  • 5
  • 22
  • 26

3 Answers3

2
library(plyr)
ddply(data, "stations", function(data){data[which.max(data$mag),]})
       lat   long depth mag stations
1   -27.21 182.43    55 4.6       10
2   -27.60 182.40    61 4.6       11
3   -16.24 168.02    53 4.7       12
4   -27.38 181.70    80 4.8       13
-----

You can also use:

> data2 <- data[order(data$mag,decreasing=T),]
> data2[!duplicated(data2$stations),]
        lat   long depth mag stations
152  -15.56 167.62   127 6.4      122
15   -20.70 169.92   139 6.1       94
17   -13.64 165.96    50 6.0       83
870  -12.23 167.02   242 6.0      132
1000 -21.59 170.56   165 6.0      119
558  -22.91 183.95    64 5.9      118
109  -22.55 185.90    42 5.7       76
151  -23.34 184.50    56 5.7      106
176  -32.22 180.20   216 5.7       90
275  -22.13 180.38   577 5.7      104

Also :

> library(data.table)
> data <- data.table(data)
> data[,.SD[which.max(mag)],by=stations]
     stations    lat   long depth mag
  1:       41 -23.46 180.11   539 5.0
  2:       15 -13.40 166.90   228 4.8
  3:       43 -26.00 184.10    42 5.4
  4:       19 -19.70 186.20    47 4.8
  5:       11 -27.60 182.40    61 4.6
 ---                                 
 98:       77 -21.19 181.58   490 5.0
 99:      132 -12.23 167.02   242 6.0
100:      115 -17.85 181.44   589 5.6
101:      121 -20.25 184.75   107 5.6
102:      110 -19.33 186.16    44 5.4  

data.table works better for large dataset

Prasanna Nandakumar
  • 4,295
  • 34
  • 63
0

You could try tapply, too:

tapply(data$mag, data$stations, FUN=max)
sgibb
  • 25,396
  • 3
  • 68
  • 74
0

You can try the new 'dplyr' package as well, which is much faster and easier to use than 'plyr'. Using what Hadley called "like a grammar of data manipulation" by chaining the operations together with %.%, like so :

library(dplyr)
df <- as.data.frame(quakes)

df %.%
  group_by(stations) %.%
  summarise(Max = max(mag)) %.%
  arrange(desc(Max)) %.%
  head(5)

Source: local data frame [5 x 2]

  stations Max
1      122 6.4
2       94 6.1
3      132 6.0
4      119 6.0
5       83 6.0
hvollmeier
  • 2,956
  • 1
  • 12
  • 17