9

I don't understand why I can't find a solution for this, since I feel that this is a pretty basic question. Need to ask for help, then. I want to rearrange airquality dataset by month with maximum temp value for each month. In addition I want to find the corresponding day for each monthly maximum temperature. What is the laziest (code-wise) way to do this?

I have tried following without a success:

require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"))

dcast(mm, month + day ~ variable, max)
aggregate(formula = temp ~ month + day, data = airquality, FUN = max)

I am after something like this:

month day temp
5     7    89
...
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Mikko
  • 7,530
  • 8
  • 55
  • 92

4 Answers4

8

There was quite a discussion a while back about whether being lazy is good or not. Anwyay, this is short and natural to write and read (and is fast for large data so you don't need to change or optimize it later) :

require(data.table)
DT=as.data.table(airquality)

DT[,.SD[which.max(Temp)],by=Month]

     Month Ozone Solar.R Wind Temp Day
[1,]     5    45     252 14.9   81  29
[2,]     6    NA     259 10.9   93  11
[3,]     7    97     267  6.3   92   8
[4,]     8    76     203  9.7   97  28
[5,]     9    73     183  2.8   93   3

.SD is the subset of the data for each group, and you just want the row from it with the largest Temp, iiuc. If you need the row number then that can be added.

Or to get all the rows where the max is tied :

DT[,.SD[Temp==max(Temp)],by=Month]

     Month Ozone Solar.R Wind Temp Day
[1,]     5    45     252 14.9   81  29
[2,]     6    NA     259 10.9   93  11
[3,]     7    97     267  6.3   92   8
[4,]     7    97     272  5.7   92   9
[5,]     8    76     203  9.7   97  28
[6,]     9    73     183  2.8   93   3
[7,]     9    91     189  4.6   93   4
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Thanks guys! I haven't used data.table package previously, so it's about a time. "Answer accepted" because this one was the most complete (and slightly shorter than Chritoph_J's). All of you would have deserved it, though (+1 for everyone). – Mikko May 22 '12 at 16:17
  • I wonder if there is a more concise solution in any language. I'm guessing not. – Ben Rollert Apr 15 '14 at 14:52
5

Another approach with plyr

require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"), value.name = 'temp')

library(plyr)

ddply(mm, .(month), subset, subset = temp == max(temp), select = -variable)

Gives

  month day temp
1     5  29   81
2     6  11   93
3     7   8   92
4     7   9   92
5     8  28   97
6     9   3   93
7     9   4   93

Or, even simpler

require(reshape2)
require(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, .(month), subset, 
  subset = temp == max(temp), select = c(month, day, temp) )
mnel
  • 113,303
  • 27
  • 265
  • 254
2

how about with plyr?

max.func <- function(df) {
   max.temp <- max(df$temp)

   return(data.frame(day = df$Day[df$Temp==max.temp],
                     temp = max.temp))
}

ddply(airquality, .(Month), max.func)

As you can see, the max temperature for the month happens on more than one day. If you want different behavior, the function is easy enough to adjust.

Justin
  • 42,475
  • 9
  • 93
  • 111
2

Or if you want to use the data.table package (for instance, if speed is an issue and the data set is large or if you prefer the syntax):

library(data.table)
DT <- data.table(airquality)
DT[, list(maxTemp=max(Temp), dayMaxTemp=.SD[max(Temp)==Temp, Day]), by="Month"]

If you want to know what the .SD stands for, have a look here: SO

Community
  • 1
  • 1
Christoph_J
  • 6,804
  • 8
  • 44
  • 58