0

I have some observed data by hour. I am trying to subset this data by the day or even week intervals. I am not sure how to proceed with this task in R.

The sample of the data is below.

date                                 obs
2011-10-24 01:00:00                  12
2011-10-24 02:00:00                  4
2011-10-24 19:00:00                  18
2011-10-24 20:00:00                  7
2011-10-24 21:00:00                  4
2011-10-24 22:00:00                  2
2011-10-25 00:00:00                  4
2011-10-25 01:00:00                  2
2011-10-25 02:00:00                  2
2011-10-25 15:00:00                  12
2011-10-25 18:00:00                  2
2011-10-25 19:00:00                  3
2011-10-25 21:00:00                  2
2011-10-25 23:00:00                  9
2011-10-26 00:00:00                  13
2011-10-26 01:00:00                  11
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
notrockstar
  • 833
  • 3
  • 15
  • 28

2 Answers2

2

First I entered the data with the multiple spaces replaced with tabs.

dat$date <- as.POSIXct(dat$date, format="%Y-%m-%d %H:%M:%S")
split(dat , as.POSIXlt(dat$date)$yday)
# Notice these are not the same functions
#---------------------
$`296`
                 date obs
1 2011-10-24 01:00:00  12
2 2011-10-24 02:00:00   4
3 2011-10-24 19:00:00  18
4 2011-10-24 20:00:00   7
5 2011-10-24 21:00:00   4
6 2011-10-24 22:00:00   2

$`297`
                  date obs
7  2011-10-25 00:00:00   4
8  2011-10-25 01:00:00   2
9  2011-10-25 02:00:00   2
10 2011-10-25 15:00:00  12
11 2011-10-25 18:00:00   2
12 2011-10-25 19:00:00   3
13 2011-10-25 21:00:00   2
14 2011-10-25 23:00:00   9

$`298`
                  date obs
15 2011-10-26 00:00:00  13
16 2011-10-26 01:00:00  11

The POSIXlt class does not work well inside dataframes but it can ve very handy for creating time based groups. It's a list structure with these indices: 'yday', 'wday', 'year', 'mon', 'mday', 'hour', 'min', 'sec' and 'isdt'. The cut.POSIXt function adds divisions at other natural boundaries; E.g.

?cut.POSIXt
  split(dat , cut(dat$date, "week") )

If you wanted to sum within date:

tapply(dat$obs, as.POSIXlt(dat$date)$yday, sum)
#-------
296 297 298 
 47  36  24 
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thank you, @DWin. How would you do it by weekly? – notrockstar Jul 06 '12 at 20:39
  • Any solutions for data frames? – notrockstar Jul 06 '12 at 20:40
  • To the first question use cut(dat$date, breaks="week"). To the second question .... huh? .... that is a dataframe. – IRTFM Jul 06 '12 at 20:44
  • my second question was regarding the sentence about POSIXit not working well inside dataframes. – notrockstar Jul 06 '12 at 20:47
  • I was only advising not to use `POSIXlt` as the class for the contents of dataframes. You can certainly use `POSIXct` to `POSIXlt` as an output strategy. That's really what it is for. – IRTFM Jul 06 '12 at 20:50
  • Thank you. I was able to partition it within data frame as you advised. How could I extract each subset now to a separate data frame? – notrockstar Jul 06 '12 at 20:58
  • You can index lists either by numeric position or by name. The trick with indexing these by name is to use quotes: '298' rather than raw numbers. Assuming you assigned the result to 'segdat', then these calls would give the same results: segdat[['298']] and segdat[[3]] – IRTFM Jul 06 '12 at 21:40
2

I'd use a time series class such as xts

dat <- read.table(text="2011-10-24 01:00:00                  12
2011-10-24 02:00:00                  4
2011-10-24 19:00:00                  18
2011-10-24 20:00:00                  7
2011-10-24 21:00:00                  4
2011-10-24 22:00:00                  2
2011-10-25 00:00:00                  4
2011-10-25 01:00:00                  2
2011-10-25 02:00:00                  2
2011-10-25 15:00:00                  12
2011-10-25 18:00:00                  2
2011-10-25 19:00:00                  3
2011-10-25 21:00:00                  2
2011-10-25 23:00:00                  9
2011-10-26 00:00:00                  13
2011-10-26 01:00:00                  11", header=FALSE, stringsAsFactors=FALSE)

xobj <- xts(dat[, 3], as.POSIXct(paste(dat[, 1], dat[, 2])))

xts subsetting is very intuitive. For all data on "2011-10-25", do this

xobj["2011-10-25"]
#                    [,1]
#2011-10-25 00:00:00    4
#2011-10-25 01:00:00    2
#2011-10-25 02:00:00    2
#2011-10-25 15:00:00   12
#2011-10-25 18:00:00    2
#2011-10-25 19:00:00    3
#2011-10-25 21:00:00    2
#2011-10-25 23:00:00    9

You can also subset out time spans like this (all data between and including 2011-10-24 and 2011-10-25)

xobj["2011-10-24/2011-10-25"]

Or, if you want all data from October 2011,

xobj["2011-10"]

If you want to get all data from any day that is between 19:00 and 20:00,

xobj['T19:00:00/T20:00:00']
#                    [,1]
#2011-10-24 19:00:00   18
#2011-10-24 20:00:00    7
#2011-10-25 19:00:00    3

You can use the endpoints function to find the rows that are the last rows of a time period ("hours", "days", "weeks", etc.)

endpoints(xobj, "days")
[1]  0  6 14 16    

Or you can convert to a lower frequency

to.weekly(xobj)
#           xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-26        12        18        2         11
to.daily(xobj)
#           xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-25        12        18        2          2
#2011-10-26         4        12        2          9
#2011-10-26        13        13       11         11

Notice that the above creates columns for Open, High, Low, and Close. If you only want the data at the endpoints, you can use OHLC=FALSE

to.daily(xobj, OHLC=FALSE)
#           [,1]
#2011-10-25    2
#2011-10-26    9
#2011-10-26   11

For more basic subsetting, and much more, visit http://www.quantmod.com/examples/

As @JoshuaUlrich mentions in the comments, split.xts is INCREDIBLY useful.

You can split by day (or week, or month, etc), apply a function, then recombine

split(xobj, 'days') #create a list where each element is the data for a different day
#[[1]]
#                    [,1]
#2011-10-24 01:00:00   12
#2011-10-24 02:00:00    4
#2011-10-24 19:00:00   18
#2011-10-24 20:00:00    7
#2011-10-24 21:00:00    4
#2011-10-24 22:00:00    2
#
#[[2]]
#                    [,1]
#2011-10-25 00:00:00    4
#2011-10-25 01:00:00    2
#2011-10-25 02:00:00    2
#2011-10-25 15:00:00   12
#2011-10-25 18:00:00    2
#2011-10-25 19:00:00    3
#2011-10-25 21:00:00    2
#2011-10-25 23:00:00    9
#
#[[3]]
#                    [,1]
#2011-10-26 00:00:00   13
#2011-10-26 01:00:00   11

Suppose you want only the first value of each day. split by day, lapply the first function and rbind back together.

do.call(rbind, lapply(split(xobj, 'days'), first))
#                    [,1]
#2011-10-24 01:00:00   12
#2011-10-25 00:00:00    4
#2011-10-26 00:00:00   13
GSee
  • 48,880
  • 13
  • 125
  • 145
  • 1
    Don't forget `split.xts`: `split(xobj, "days")`. – Joshua Ulrich Jul 06 '12 at 20:43
  • @GSee, thank you! For some reason running: `xobj <- xts(dat[, 3], as.POSIXct(paste(dat[, 1], dat[, 2])))` gives me an error : `order.by requires an appropriate time-based object`. Any advice? – notrockstar Jul 06 '12 at 20:55
  • The first argument to the function `xts` should be the data, and the second argument should be the time index. If you copy and paste the entire first block of code in my answer, it should work (and does for me) because the first 2 columns are date and time, and the 3rd column is the data. The vignettes in the `zoo` and `xts` packages will help you figure out how to turn your data into `xts`. Or, if you provide the `dput` of your data, I'll update my answer for your specific data. – GSee Jul 06 '12 at 21:03