1

I have a data set of about 28,000 entries. They consist of date/time stamps in the following format

02/21/2014 12:11:24 PM

I would like to create a couple graphs with the data to better visualize it. If someone could point me in the right direction as to how to make a graph that shows the amount of entries within a certain period that would be great. The plan is for the graph to identify how many people applied within the universe of the data set (about 3 weeks) every hour.

So, if there are 4 entries between 11:00 and 11:59pm on 2/21, I would like the graph to have a value of 4 on the y-axis.

If you feel there is a better platform to do this in that would be also appreciated.

  • 1
    Learn how to use `tapply` and see what you can figure out from that. – Eli Sadoff Jul 03 '16 at 02:24
  • First parse your dates with `as.POSIXct` or `strptime`, then use `cut.POSIXt` and check out `aggregate` or its dplyr/data.table equivalents (or even just `table`). Come back if you have a problem. – alistaire Jul 03 '16 at 02:42

2 Answers2

2

plot

## generate data
set.seed(1L);
N <- 28e3L;
dts <- sort(as.POSIXct('2014-02-01')+86400L*(sample(7L*3L,N,T,rep(c(1L,2L,rep(10L,5L)),3L))-1L)+rnorm(N,86400/2,86400/8));

## bucket into hours and table
dts.cut <- cut(dts,'hour');
dts.freq <- table(dts.cut);

## precompute plot parameters
xlim <- range(dts); xlim <- as.POSIXct(c(round(xlim[1L]-86400/2,'day'),round(xlim[2L]+86400/2,'day'))); ## must convert back from POSIXlt to POSIXct, otherwise plot() fails on xlim
xticks.day <- seq(xlim[1L],xlim[2L],'day');
xticks.week <- xticks.day[setdiff(which(weekdays(xticks.day)=='Saturday'),c(1L,length(xticks.day)))];
xticks <- rep(xticks.day,each=3L)+1:3*60*60*6;
ylim <- range(dts.freq); ylim <- c(0,(ylim[2L]+9L)%/%10L*10L);
yticks <- seq(0,ylim[2L],10L);
col <- 'red';

## helper function, from <http://stackoverflow.com/questions/29125019/get-margin-line-locations-mgp-in-user-coordinates>
line2user <- function(line,side) {
    lh <- par('cin')[2L]*par('cex')*par('lheight');
    x1 <- diff(grconvertX(0:1,'inches','user'));
    y1 <- diff(grconvertY(0:1,'inches','user'));
    switch(side,
        `1`=par('usr')[3L]-line*y1*lh,
        `2`=par('usr')[1L]-line*x1*lh,
        `3`=par('usr')[4L]+line*y1*lh,
        `4`=par('usr')[2L]+line*x1*lh,
        stop('side must be 1, 2, 3, or 4',call.=FALSE)
    );
}; ## end line2user()

## draw plot
par(mar=c(5,4,4,2)+0.1+c(2,0,0,0));
plot(NA,xlim=xlim,ylim=ylim,axes=F,xaxs='i',yaxs='i',ann=F);
abline(v=xticks,col='lightgrey');
segments(xticks.day,ylim[2L],y1=line2user(4,1L),col='darkgrey',lwd=2,xpd=NA);
segments(xticks.week,ylim[2L],y1=line2user(4,1L),col='black',lwd=2,xpd=NA);
abline(h=yticks,col='lightgrey');
abline(h=0);
axis(1L,xticks,format(xticks,'%H:00'),las=2L,cex.axis=0.7);
axis(2L,yticks,las=2L,cex.axis=0.7);
mtext('Time',1L,5,font=3L);
mtext('Frequency',2L,2.75,font=3L);
mtext(format(xticks.day[-length(xticks.day)],'%a %b %d'),1L,2.75,at=xticks.day[-length(xticks.day)]+12*60*60,cex=0.7,font=2L);
x <- as.POSIXct(names(dts.freq));
y <- dts.freq;
lines(x,y,col=col,xpd=NA);
points(x,y,pch=16L,cex=0.7,col=col,xpd=NA);
title(paste0('Events per hour, ',format(xlim[1L],'%Y-%m-%d'),' to ',format(xticks.day[length(xticks.day)-1L],'%Y-%m-%d')));
bgoldst
  • 34,190
  • 6
  • 38
  • 64
1

R is perfect for this. There are several packages which could be useful. I've shown some example data below and one of the most simple plotting routines. Additional plots which may be also interesting for you can be found in the ggplot2 package.

Use the lubridate package to make parsing of dates easier. You'll first need to import the data. Since example imported data wasn't provided, I've included some general tips at the end.

install.packages("lubridate")
library(lubridate)

Generate some example data:

Lubridate has a collection of similar functions which use the letters m, d, y, h, m, and s. You can arrange the letters in many different orders and there usually exists a function in the package which will parse your dates. For example, if you have a date only, like 2014/02/21, then you would use the ymd() function. For your data as described, you'll want mdy_hms(). You won't need seq() for imported data, but it's here to generate an example.

start_date <- mdy_hms("02/21/2014 12:11:24 PM")
end_date <- mdy_hms("02/22/2014 12:11:24 PM")

date.sequence <- seq(start_date,end_date, by = '1 hour')

Plot as a histogram and use the frequency option:

This will give you integers which are the counts of each bin on the y-axis. Without that, you would have density, which means the entire plot would be normalized such that the area under the entire curve equals 1. The second argument is called bin and you can replace it with a number like 20 or 100 as well. 28,000 probably wouldn't give a nice chart.

hist(date.sequence, length(date.sequence), freq = TRUE)

hist

Addendum on importing data:

This was not originally asked, but could also help for importing.

Import the data from a CSV file. The as.is argument will ensure that R doesn't use it's default methods to interpret the dates such that the lubridate package can be used later.

all.dates <- read.table( "filename.csv", as.is=TRUE )

Then, select the appropriate function from lubridate according to your format. For example:

all.dates.reformatted <- mdy_hms(all.dates)
Bobby
  • 1,585
  • 3
  • 19
  • 42
  • Thank you. The data set provided isn't the easiest to work with in my opinion, the source data appears to have extra characters here and there. I am very new at this as well, so pardon the questions. I'm going to try to import the table and see what happens. Thanks! – user2594004 Jul 03 '16 at 04:14
  • In this case, `lubridate` is possibly a really great option for you. It is tolerant of extra characters and will attempt to transform all dates into a usable standard. It will also tell you which entries could not be parsed. In any case, I would do your data cleaning in R rather than trying to do it first in an external program. – Bobby Jul 03 '16 at 04:16
  • how would I tell R/lubridate that the format I have imported the data in is in the format it currently is? I'm using R studio to import the data from a csv, and the date and time are in one cell. – user2594004 Jul 03 '16 at 04:19
  • Added some notes on importing above. – Bobby Jul 03 '16 at 04:27
  • When performing the import with as.is=TRUE, and then running View(all.dates), the date, time, and pm/am parameter were split apart into 3 separate columns. How would I tell lubridate to pull the rest of the parameters from the 2 other columns. – user2594004 Jul 03 '16 at 04:39
  • First combine the columns into a single one with 'paste0()'. Then use lubridate on this. – Bobby Jul 03 '16 at 04:43
  • df$newCol <- paste0(df$col1, "/", df$col2) – Bobby Jul 03 '16 at 04:45
  • Have you made some progress? – Bobby Jul 04 '16 at 20:02