6

I have a long data frame that contains meteorological data from a mast. It contains observations (data$value) taken at the same time of different parameters (wind speed, direction, air temperature, etc., in data$param) at different heights (data$z)

I am trying to efficiently slice this data by $time, and then apply functions to all of the data collected. Usually functions are applied to a single $param at a time (i.e. I apply different functions to wind speed than I do to air temperature).

Current approach

My current method is to use data.frame and ddply.

If I want to get all of the wind speed data, I run this:

# find good data ----
df <- data[((data$param == "wind speed") &
                  !is.na(data$value)),]

I then run my function on df using ddply():

df.tav <- ddply(df,
               .(time),
               function(x) {
                      y <-data.frame(V1 = sum(x$value) + sum(x$z),
                                     V2 = sum(x$value) / sum(x$z))
                      return(y)
                    })

Usually V1 and V2 are calls to other functions. These are just examples. I do need to run multiple functions on the same data though.

Question

My current approach is very slow. I have not benchmarked it, but it's slow enough I can go get a coffee and come back before a year's worth of data has been processed.

I have order(hundred) towers to process, each with a year of data and 10-12 heights and so I am looking for something faster.

Data sample

data <-  structure(list(time = structure(c(1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 
1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0, 
0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160, 
160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60, 
60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature", 
"humidity", "barometric pressure", "wind direction", "turbulence", 
"wind speed", "wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"temperature", "barometric pressure", "humidity", "wind direction", 
"wind speed", "turbulence", "wind direction"), value = c(-2.5, 
41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32, 
4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35, 
250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9, 
0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time", 
"z", "param", "value"), row.names = c(NA, 40L), class = "data.frame")
Andy Clifton
  • 4,926
  • 3
  • 35
  • 47

2 Answers2

14

Use data.table:

library(data.table)
dt = data.table(data)

setkey(dt, param)  # sort by param to look it up fast

dt[J('wind speed')][!is.na(value),
                    list(sum(value) + sum(z), sum(value)/sum(z)),
                    by = time]
#                  time      V1         V2
#1: 2009-12-31 18:10:00 1177.57 0.04209735
#2: 2009-12-31 18:20:00  102.18 0.02180000

If you want to apply a different function for each param, here's a more uniform approach for that.

# make dt smaller because I'm lazy
dt = dt[param %in% c('wind direction', 'wind speed')]

# now let's start - create another data.table
# that will have param and corresponding function
fns = data.table(p = c('wind direction', 'wind speed'),
                 fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))),
                 key = 'p')
fns
                p     fn
1: wind direction <call>    # the fn column contains functions
2:     wind speed <call>    # i.e. this is getting fancy!

# now we can evaluate different functions for different params,
# sliced by param and time
dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)},
   by = list(param, time)]
#            param                time           V1
#1: wind direction 2009-12-31 18:10:00 3.712400e+03
#2: wind direction 2009-12-31 18:20:00 7.027000e+02
#3:     wind speed 2009-12-31 18:10:00 4.209735e-02
#4:     wind speed 2009-12-31 18:20:00 2.180000e-02

P.S. I think the fact that I have to use param in some way before eval for eval to work is a bug.


UPDATE: As of version 1.8.11 this bug has been fixed and the following works:

dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • How would this work if the functions were not `sum()`, but instead I need to pass in value and z - e.g. `myFunction(value,z)`? Do I just put that call in the `list()`? – Andy Clifton Sep 27 '13 at 15:58
  • @AndyClifton just write `myFunction(value, z)`? – eddi Sep 27 '13 at 16:02
  • 2
    The second approach is interesting but at the limits of readability (for me). I've gone with the first, using `list(V1 = myFunction1(value,z), V2 = myFunction2(value,z))`. Speed up is about a factor 100. – Andy Clifton Sep 27 '13 at 17:08
9

Use dplyr. It's still in development, but it's much much faster than plyr:

# devtools::install_github(dplyr)
library(dplyr)

windspeed <- subset(data, param == "wind speed")
daily <- group_by(windspeed, time)

summarise(daily, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

The other advantage of dplyr is that you can use a data table as a backend, without having to know anything about data.table's special syntax:

library(data.table)
daily_dt <- group_by(data.table(windspeed), time)
summarise(daily_dt, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z))

(dplyr with a data frame is 20-100x faster than plyr, and dplyr with a data.table is about another 10x faster). dplyr is nowhere near as concise as data.table, but it has a function for each major task of data analysis, which I find makes the code easier to understand - you speed almost be able to read a sequence of dplyr operations to someone else and have them understand what's going on.

If you want to do different summaries per variable, I recommend changing your data structure to be "tidy":

library(reshape2)
data_tidy <- dcast(data, ... ~ param)

daily_tidy <- group_by(data_tidy, time)
summarise(daily_tidy, 
  mean.pressure = mean(`barometric pressure`, na.rm = TRUE),
  sd.turbulence = sd(`barometric pressure`, na.rm = TRUE)
)
hadley
  • 102,019
  • 32
  • 183
  • 245
  • 1
    Hi @hadley. What _are_ the 11+ functions you tweeted about, and the major tasks exactly that you map to, mentioned here? I looked on dplyr homepage and saw 5: select(), filter(), mutate(), summarise() and arrange(). When I look at `group_by(data.table(windspeed), time)` I find it confusing because there's no aggregate there. Does `group_by` split the data? Or does it set the key maybe? – Matt Dowle Sep 28 '13 at 15:12
  • 3
    Btw, to read out loud a `data.table` query you say "from `DT` subset `i` then do `j` grouped by `by`". It's really only 3 simple arguments : DT[`i`,`j`,`by`]. It might be easier for SQL folk to click with. – Matt Dowle Sep 28 '13 at 15:23
  • @MatthewDowle `group_by` is conceptually similar to setting the key - you're describing that you want do the analysis "by" a group (it's translated to `GROUP BY` with an sql backend). The other verbs are group by, and the (currently) four joins. – hadley Sep 28 '13 at 18:49
  • @MatthewDowle I find it hard to apply your reading suggestion to the examples in the other answer, esp. with the eval etc. I also find it hard to determine the correct data table syntax because actions behave different inside and outside of `[`. – hadley Sep 28 '13 at 18:50
  • 5
    You're not being fair here. That part of @eddi's answer is complicated because he's keeping the data wide and demonstrating stretching data.table; e.g., he created a data.table that contains functions and looks them up for goodness sake. You're not comparing like with like. Btw, don't you need to deal with NA in the `subset` or add `na.rm` to each of the `sum` calls? – Matt Dowle Sep 28 '13 at 20:16
  • Could you take a look at the benchmark Thell updated [here](http://stackoverflow.com/a/11921237/403310) please. Is that right? If it is, why is dplyr using data.table back end slower than using data.table directly? Something seems amiss there to me. – Matt Dowle Sep 28 '13 at 20:26
  • @MatthewDowle I'll take a look at the benchmarks, but I'm obviously not the best data.table programmer in the world so patches are welcome (e.g. see https://github.com/hadley/dplyr/blob/master/R/manip-grouped-dt.r) – hadley Sep 28 '13 at 20:33
  • @MatthewDowle plus there will be some overhead to using a wrapper, some of which could be reduced by taking a lazy approach, like the sqlite backend does. – hadley Sep 28 '13 at 20:36
  • Sure but the wrapper only runs through once doesn't it, so any overhead would be minimal (not like it's being looped or anything) ... oh ... there we go ... benchmark's replications=100 rears its ugly head yet again. – Matt Dowle Sep 28 '13 at 20:43
  • I had a brief look at `manip-grouped-dt.r`. There's 170 lines there. When you said about this layer stuff I imagined `dplyr` would just manipulate the syntax and then dispatch the `data.table` query. But there's all kinds of stuff I don't understand going on as well in there. Will the optimization of combined `DT[i,j,by]` happen when `subset()` is separated from the `summarise()` for example? That's one reason the `DT[i,j,by]` syntax is the way it is inside one `DT[...]` call; i.e., the `[.data.table` internals see `i` and `j` and `by` together as a tuple before evaluation. – Matt Dowle Sep 28 '13 at 22:10
  • @MatthewDowle not currently, but it could (which is what the sql backend does). It's relatively straightforward to cache all desired operations and then execute once on demand. – hadley Sep 28 '13 at 22:49
  • Sounds complicated to me. On the SQL backend I think you'll struggle with performance there. Is `dplyr` going to support a column database as a backend too? – Matt Dowle Sep 30 '13 at 09:43
  • @MatthewDowle it's not too complicated, and I have to deal with the complication, not the user. Yes, plan to support column oriented databases. – hadley Sep 30 '13 at 12:30