1

I working with monthly climate data from several weather stations across the Greater Albuquerque Area, I have taken this subset for the airport data as an example, I will eventually apply this same process to all locations. There are close to 500 months of data available but I have included the first 30 here.

> head(ABQ, 30)
                                STATION_NAME       DATE CLDD
9698 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1945-05-01  449
9699 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1945-06-01 1335
9700 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1945-07-01 2330
9701 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1945-08-01 2269
9702 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1945-09-01 1247
9703 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1945-10-01   13
9709 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1946-04-01   62
9710 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1946-05-01  251
9711 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1946-06-01 2097
9712 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1946-07-01 2303
9713 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1946-08-01 1889
9714 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1946-09-01 1111
9715 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1946-10-01   23
9721 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1947-04-01    1
9722 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1947-05-01  611
9723 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1947-06-01 1273
9724 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1947-07-01 2636
9725 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1947-08-01 1892
9726 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1947-09-01 1265
9727 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1947-10-01  171
9733 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1948-04-01   91
9734 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1948-05-01  642
9735 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1948-06-01 1506
9736 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1948-07-01 2529
9737 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1948-08-01 2186
9738 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1948-09-01 1130
9739 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1948-10-01   13
9745 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1949-04-01   88
9746 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1949-05-01  304
9747 ALBUQUERQUE INTERNATIONAL AIRPORT NM US 1949-06-01 1477

I would like to call the yearly sum of ABQ$CLDD and apply that value to a ggplot()... something like this

    CLDD_yr <- apply.yearly(ABQ$DATE, sum(CLDD))
    p <- ggplot(CLDD_yr, aes(YEAR, CLDD_yr)),
         + stat_smooth(method = "lm", formula = y~x + I(x^2), size = 1)

I know I am making a mistake somewhere in calling the data I think but I can not seem to sort this out.

The DATE column is POSIX time as seen here

> class(ABQ$DATE)
[1] "POSIXlt" "POSIXt" 

EDIT: per coffienjunkies comments

perhaps a new df would be the best way to approach this as I will need to look at data for more than one location in the same manner

> stations
      unique(Bernalillo_data$STATION_NAME)
1  ALBUQUERQUE INTERNATIONAL AIRPORT NM US
2            PETROGLYPH NATIONAL MON NM US
3                        SANDIA PARK NM US
4                    ALBUQUERQUE VLY NM US
5           ALBUQUERQUE FOOTHILLS NE NM US
6              SANDIA RANGER STATION NM US
7                       SANDIA CREST NM US
8                 LA MADERA SKI AREA NM US
9                    NETHERWOOD PARK NM US
10                   EXPERIMENT FARM NM US
11                      KIRTLAND AFB NM US

maybe the new DF should be something like

header <-  station_name    Year    CLDD_sum

this would make the analysis simpler I think in the long run.

c0ba1t
  • 241
  • 2
  • 15
  • Why the reluctance to create a dataframe summarizing the values? One way or another, some aggregation will have to occur at some point. – coffeinjunky Mar 29 '16 at 15:16
  • @coffeinjunky, I would like to be able to keep referencing the same data throughout my script for readability. it will be seen by other people... I am not completely opposed to it I guess, I just want the code to do the work so to speak – c0ba1t Mar 29 '16 at 15:33
  • Would creating new columns be an option? – coffeinjunky Mar 29 '16 at 15:35
  • Sure, but this would lead to a 'melt' scenario right? I would probably be better off just making a new df... perhaps you can pose that answer.. I think it would still be relevant, how do I sum yearly values as a call from YYYY-MM-DD posix times? – c0ba1t Mar 29 '16 at 15:37
  • 1
    `new.df <- aggregate(data = ABQ, CLDD ~ DATE$year, sum)` – JasonAizkalns Mar 29 '16 at 15:46
  • @JasonAizkalns, Thanks. that is nice... what would the analog be for a df that had multiple stations? – c0ba1t Mar 29 '16 at 15:52

2 Answers2

2

Try this,

require(data.table)
setDT(ABQ)
ABQ[, CLDD_yr := sum(CLDD), by = year(DATE)]

# Required because data.table and ggplot don't play nice.
setDF(ABQ)


p <- ggplot(ABQ, aes(YEAR, CLDD_yr)),
  + stat_smooth(method = "lm", formula = y~x + I(x^2), size = 1)

note that you will have to install data.table. Note that this will create your summary statistic for every row so you might get several dots overlapping in ggplot. If you do not want that you could try,

require(data.table)
setDT(ABQ)
for_plot <- ABQ[, .(CLDD_yr := sum(CLDD)), by = list(year = year(DATE))]

# Required because data.table and ggplot don't play nice.
setDF(for_plot)


p <- ggplot(for_plot, aes(year, CLDD_yr)),
  + stat_smooth(method = "lm", formula = y~x + I(x^2), size = 1)

Hope this helps.

Stereo
  • 1,148
  • 13
  • 36
  • this is a good solution.. I am trying to be minimize new packages but this does work. thank you – c0ba1t Mar 29 '16 at 16:17
  • For convenience you will probably have to use `data.table` or `dplyr`. I prefer the former for performance reasons and it works well with `xts` objects. – Stereo Mar 29 '16 at 16:22
1

I think there are many approaches you could use, but some aggregation will have to occur at some point. Here are two suggestions:

library(dplyr)
library(ggplot2)
df$year <- df$DATE$year
df$DATE <- as.POSIXct(df$DATE) # dplyr doesn't play well with POSIXlt
df_yr <- df %>% group_by(year) %>% summarise(cldd_yr = sum(CLDD))

This yields:

Source: local data frame [5 x 2]

   year cldd_yr
  (chr)   (int)
1  1945    7643
2  1946    7736
3  1947    7849
4  1948    8097
5  1949    1869

which you can use in combination with ggplot. For multiple stations, just add the station as grouping variable. For instance, df_yr <- df %>% group_by(year, station) %>% summarise(cldd_yr = sum(CLDD)) will give you the summary for all years and stations, provided that station is how your identifier is called.

If you really don't want to use a new dataframe but are ok with adding a column, try

 df <- group_by(df, year) %>% mutate(yr.sum = sum(CLDD))

In yr.sum, you have the yearly sum. Note that this value is repeating, and you will have to make sure that ggplot uses it correctly. I would propose to use the first approach though, as it is probably more efficient and more transparent.

coffeinjunky
  • 11,254
  • 39
  • 57
  • sure thing.. but i have a question regarding your df$DATE$year... this part is giving me an error 'object of type closure is not subsettable' I can see how your answer will work out to what I need so I selected it, and I am sure I can get there but I was not expecting this – c0ba1t Mar 29 '16 at 16:09
  • one note though.. the > df_yr <- df %>% group_by(year, station) %>% summarise(CLDD_yr = sum(CLDD)) Error: column 'DATE' has unsupported type : POSIXlt, POSIXt – c0ba1t Mar 29 '16 at 16:12
  • > library(dplyr) > library(ggplot2) > df <- Bernalillo_data > df$year <- df$DATE$year > df_yr <- df %>% group_by(year,STATION_NAME) %>% summarise(CLDD_yr = sum(CLDD)) Error: column 'DATE' has unsupported type : POSIXlt, POSIXt – c0ba1t Mar 29 '16 at 16:20
  • See http://stackoverflow.com/questions/27828850/dplyr-does-not-group-data-by-date for an explanation. – coffeinjunky Mar 29 '16 at 16:29