4

I am building something based upon this helpful post. I have three related questions for the dataset df:

    machine     ISOdatetime
1   M1      2013-08-21 18:16:39
2   M1      2013-08-21 18:20:44
3   M1      2013-08-21 18:21:42
4   M1      2013-08-21 18:46:09
5   M1      2013-08-21 18:46:27
6   M1      2013-08-21 19:01:13
etc

I want figure out how many values occur within half-hourly periods and put in a new dataframe, like so:

    machine     ISOdatetime     numberobs
1   M1      2013-08-21 18:30:00     3
2   M1      2013-08-21 19:00:00     2
3   M1      2013-08-21 19:30:00     1
etc

The following code of course works nicely for neat hourly lengths:

df2 <- data.frame(table(cut(df$ISOdatetime, breaks="hour")))

The following code counts in 30 min blocks, but does not start neatly at hourly/half hourly points (it takes the starting point from the first listed time, which is 18:16:39 and designates start as 18:16:00):

df2 <-data.frame(table(cut(df$ISOdatetime, breaks = "30 mins")))

Question 1. What might be an elegant fix? Should I specify the required intervals with something like ints <-c("18:00", "18:30", "19:00" ...) , or is unnecessary?

Question 2. I think I will also run into trouble when I reach parts of the original dataframe df that have values for "M2" under df$machine because it will just count those as well. I will eventually want to plot each machine separately. Perhaps using subset for each "machine" will be a quick way to partition the data, but then I will end up with a dataframe for each "machine". Not a problem, but is there an elegant way to build "machine" into the command above?

Question 3. In the previous post, their count was presented at the "top of the hour", which is presumably the "end time" of the hourly interval. But it was not easy to check that with the small dataset they presented. In my own data, the counts seemed to be out. With breaks=hour, what should I expect the count to be for?

Have read and tried much over many recent hours and still stuck, help very much appreciated.

#

As requested, I have added further info.

My actual data

    unit    nightof     time        date        isodatetime             time2
1   7849    2013-08-21  18:16:39    2013-08-21  2013-08-21 18:16:39     2013-08-22 04:00:00
2   7849    2013-08-21  18:20:44    2013-08-21  2013-08-21 18:20:44     2013-08-22 04:00:00
3   7849    2013-08-21  18:21:42    2013-08-21  2013-08-21 18:21:42     2013-08-22 04:00:00
etc
406 7849    2013-08-21  04:06:10    2013-08-22  2013-08-22 04:06:10     2013-08-22 14:00:00
407 7849    2013-08-21  04:06:12    2013-08-22  2013-08-22 04:06:12     2013-08-22 14:00:00
408 7849    2013-08-21  04:06:28    2013-08-22  2013-08-22 04:06:28     2013-08-22 14:00:00

When I str()

'data.frame':       408 obs. of  6 variables:
$ unit:         int  7849 7849 7849 7849 7849 7849 7849 7849 7849 7849 ...
$ nightof:  Date, format: "2013-08-21" "2013-08-21" "2013-08-21" "2013-08-21" ...
$ time:     List of 408
..$ : chr "18:16:39"
..$ : chr "18:20:44"
.. [list output truncated]
$ date:     Date, format: "2013-08-21" "2013-08-21" "2013-08-21" "2013-08-21" ...
$ isodatetime:  POSIXlt, format: "2013-08-21 18:16:39" "2013-08-21 18:20:44" "2013-08-21 18:21:42" "2013-08-21 18:21:48" ...
$ time2:        POSIXct, format: "2013-08-22 04:00:00" "2013-08-22 04:00:00" "2013-08-22 04:00:00" "2013-08-22 04:00:00" ...

The modified code I used:

`mon$time2 <- with(mon, as.POSIXct(ceiling(as.numeric(isodatetime)/(30*60)) * (30*60), origin = "1970-01-01"))
with(mon, data.frame(table(time2)))
by(mon, mon$unit, function(x){data.frame(table(x$time2))})`

The output.

mon$unit:   7849
                Var1        Freq
1   2013-08-22 04:00:00     27
2   2013-08-22 04:30:00     13
3   2013-08-22 05:00:00     16
4   2013-08-22 05:30:00     5
5   2013-08-22 06:00:00     8
6   2013-08-22 06:30:00     10
7   2013-08-22 07:00:00     25
8   2013-08-22 07:30:00     22
9   2013-08-22 08:00:00     61
10  2013-08-22 08:30:00     93
11  2013-08-22 09:00:00     54
12  2013-08-22 09:30:00     42
13  2013-08-22 10:00:00     11
14  2013-08-22 10:30:00     2
15  2013-08-22 11:00:00     2
16  2013-08-22 11:30:00     3
17  2013-08-22 12:00:00     2
18  2013-08-22 13:00:00     1
19  2013-08-22 14:00:00     11
Community
  • 1
  • 1
ptenax
  • 141
  • 1
  • 14

3 Answers3

3

You can use the lubridate package to extract the hour, minutes, etc. from a date object. If you know the minute of the event time, you know in which half hour the event toke place. I sampled some data with two machines. I added a variable of the "whole" half hour after which the event toke place, then you can just count these. Hope this is what you are after, good luck.

require(lubridate)
set.seed(1234)
example.dat <- data.frame(
  machine = paste("M", sample(1:2, 100, replace = T), sep = ""),
  dates = sort(as.POSIXct(sample(1377100000:1377110000, 100), origin = "1960-01-01"))
  )
example.dat <- example.dat[order(example.dat$machine), ]

halfHours <- ifelse(minute(example.dat$dates) < 30, 0, 30)
example.dat$datesHH <- example.dat$dates
minute(example.dat$datesHH) <- halfHours; second(example.dat$datesHH) <- 0

data.frame(table(example.dat[ ,c(1,3)]))
Edwin
  • 3,184
  • 1
  • 23
  • 25
  • Thanks Edwin. It works great on my real dataset as well. It highlighted a terrible problem with some of my time conversions, and everything ran as it should after that was fixed. – ptenax Jan 27 '14 at 13:59
1

You may round a numeric representation of your times up to nearest 30 minutes using ceiling:

df$time <- with(df, as.POSIXct(ceiling(as.numeric(ISOdatetime)/(30*60)) * (30*60),
                                origin = "1970-01-01"))

# counts for each combination of time and machine
with(df, data.frame(table(time, machine)))
Henrik
  • 65,555
  • 14
  • 143
  • 159
  • Nice, works perfectly on my pretend dataset. But on my real dataset, I cannot get the new dates/times ($time2 below is df$time in your answer) to match original $isodatetime. Here is a str() excerpt. `$ isodatetime: POSIXlt, format: "2013-08-21 18:16:39" "2013-08-21 18:20:44" "2013-08-21 18:21:42" "2013-08-21 18:21:48" ... $ time2 : POSIXct, format: "2013-08-22 04:00:00" "2013-08-22 04:00:00" "2013-08-22 04:00:00" "2013-08-22 04:00:00" ...` – ptenax Jan 27 '14 at 11:56
  • I'm afraid I don't quite follow you. Can you please edit your question: `dput` a small sample of your data (see [**here**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610)), and post the code you have tried and explain what goes wrong. – Henrik Jan 27 '14 at 12:05
  • Thanks, I edited my original question. I hope it is clearer, apologies for the space. As a note, the data gets collected 'overnight' between 18:00:00 and 08:00:00 the next day, hence my column 'nightof' as a cross-check. Appreciate your trouble. – ptenax Jan 27 '14 at 14:22
  • Actually Henrik I am still keen to get your code to work on my dataset. While I accepted Edwin's answer above, both are valuable to me, and I have not nutted it out in the last couple of days despite putting in some time. If you have an idea, I would be grateful. – ptenax Jan 29 '14 at 07:13
0

Below is a function that is designed to generate such counts. Here is an example of its use:

crashDate <- as.Date(c("1908-09-17","1912-07-12","1913-08-06",
                        "1913-09-09","1913-10-17"))
df <- data.frame(date=crashDate)
byYears <- DAAGviz::eventCounts(data=df, dateCol="date",
                                 from=as.Date("1908-01-01"),
                                 by="1 year")

Output is:

> byYears
            Date n_
    1 1908-01-01  1
    2 1909-01-01  0
    3 1910-01-01  0
    4 1911-01-01  0
    5 1912-01-01  1
    6 1913-01-01  3

The argument categoryCol (character) optionally specifies the name of a column that holds categories by which counts are to classified (one column per category). The column takeOnly optionally holds a text string that when evaluated in the environment of the data frame data, yields a vector of logicals that limits the rows that are to be counted. Examples of valid by arguments are: "1 day", or "1 week", or "4 weeks", or "1 month", or "1 quarter", or "1 year", or "10 years". See help(seq.Date).

If indeed there is not already such a function in base or in the recommended packages, I may submit the above, probably with minor modifications, for inclusion.

eventCounts <-
function (data, dateCol = "Date", from = NULL, to = NULL, by = "1 month", 
          categoryCol = NULL, takeOnly = NULL, prefix = "n_") 
{
    checkCols <- c(dateCol, categoryCol) %in% names(data)
    if (!is.null(categoryCol) & !all(checkCols)) {
        txt <- paste("Name(s)", c(dateCol, categoryCol)[!checkCols], 
                     "not found in", deparse(data))
        stop(txt)
    }
    if (!is.null(takeOnly)) {
        subdat <- eval(parse(text = takeOnly), data)
        data <- subset(data, subdat)
    }
    date <- data[, dateCol]
    if (!is(date, "Date")) {
        date <- try(as.Date(date), silent = TRUE)
        if (class(date) == "try-error") 
            stop(paste("Column", dateCol, "must hold a date object"))
    }
    if (is.null(from)) 
        from <- min(date)
    if (is.null(to)) 
        to <- max(date)
    dateBreaks <- seq(from = from, to = to, by = by)
    dateBreaks <- c(dateBreaks, max(dateBreaks) + diff(dateBreaks[1:2]))
    countDF <- data.frame(Date = dateBreaks[-length(dateBreaks)])
    if (!is.null(categoryCol)) 
        categs <- names(table(data[, categoryCol]))
    else categs <- ""
    for (cat in categs) {
        if (!is.null(categoryCol)) 
            select <- data[, categoryCol] == cat
        else select <- rep(TRUE, nrow(countDF))
        cutDates <- cut(date[select], dateBreaks, right = FALSE)
        countNam <- paste0(prefix, gsub(" ", "", cat))
        countDF[, countNam] <- as.vector(table(cutDates))
    }
    countDF
}