1

I am following the following tutorial that is copy/paste reproducible here.

http://rstudio-pubs-static.s3.amazonaws.com/3369_998f8b2d788e4a0384ae565c4280aa47.html

On my setup, when I run this, I get an error

Error: Width` is deprecated. Do you want `geom_bar()`?

So I swap geom_hist for geom_bar, making my code...

ggplot(eventdata, aes(x = eventhour)) + geom_bar(breaks = seq(0, 24), width = 1, colour = "grey") + coord_polar(start = 0) + theme_minimal() + scale_fill_brewer() + ylab("Count") + ggtitle("Events by Time of day") + scale_x_continuous("", limits = c(0, 24), breaks = seq(0, 24), labels = seq(0, 24))

However, the graph I get is very different. Inspecting, it seems its "missing" the data from midnight to 1 AM ( 00:00:00 to 00:59:59, event_hour=0 )

enter image description here

I've tried running this on my own dataset (dput below) and I get a similar weird error....it combines the "0" bin and the "23" bin giving me 1 massive bin.

gplotAll = ggplot( eventdataAll, aes(x=eventdataAll$eventhour) ) +
  geom_histogram(breaks=seq(0,24), colour="purple") + coord_polar(start=0) +
  theme_minimal() + scale_fill_manual(values="blue") + ylab("Frequency") + 
  ggtitle("All Sources") + scale_x_continuous("", limits=c(0,24), 
  breaks=seq(0,24), labels=seq(0,24))

enter image description here

Note this is a very small subset of my data, as its millions of timestamps.

dput(eventdataAll[1:100,] )

structure(list(datetime = structure(c(1499433307, 1499428942, 
1499426105, 1499422506, 1499466293, 1499408104, 1499476505, 1499411705, 
1499400905, 1499466368, 1499454358, 1499453483, 1499405930, 1499484602, 
1499483709, 1499480109, 1499408108, 1499445444, 1499439817, 1499427520, 
1499418054, 1499416518, 1499414449, 1499410178, 1499409748, 1499409317, 
1499405867, 1499402279, 1499485071, 1499481544, 1499481527, 1499481459, 
1499481423, 1499481407, 1499477859, 1499475634, 1499474292, 1499474275, 
1499474253, 1499470435, 1499468435, 1499468413, 1499468398, 1499467032, 
1499464834, 1499463580, 1499463425, 1499461391, 1499460152, 1499460150, 
1499459806, 1499459745, 1499459366, 1499458914, 1499458463, 1499458012, 
1499457635, 1499457619, 1499455777, 1499454624, 1499454035, 1499454020, 
1499452801, 1499452695, 1499450434, 1499450414, 1499450404, 1499450403, 
1499450156, 1499446834, 1499446818, 1499446803, 1499445621, 1499444273, 
1499443234, 1499443218, 1499443201, 1499441873, 1499441806, 1499441700, 
1499441096, 1499441095, 1499440418, 1499440417, 1499436056, 1499434899, 
1499432434, 1499431018, 1499428801, 1499427491, 1499425201, 1499423442, 
1499421620, 1499421134, 1499427667, 1499421549, 1499472830, 1499451306, 
1499450792, 1499482802), class = c("POSIXct", "POSIXt"), tzone = ""), 
    eventhour = c(9L, 8L, 7L, 6L, 18L, 2L, 21L, 3L, 0L, 18L, 
    15L, 14L, 1L, 23L, 23L, 22L, 2L, 12L, 11L, 7L, 5L, 4L, 4L, 
    2L, 2L, 2L, 1L, 0L, 23L, 22L, 22L, 22L, 22L, 22L, 21L, 21L, 
    20L, 20L, 20L, 19L, 19L, 19L, 18L, 18L, 18L, 17L, 17L, 17L, 
    16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 15L, 15L, 
    15L, 15L, 14L, 14L, 14L, 14L, 14L, 14L, 13L, 13L, 13L, 13L, 
    12L, 12L, 12L, 12L, 12L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
    10L, 9L, 9L, 8L, 8L, 7L, 7L, 6L, 6L, 5L, 7L, 5L, 20L, 14L, 
    14L, 23L)), .Names = c("datetime", "eventhour"), row.names = c(NA, 
100L), class = "data.frame")

Any info on

(1) Why does the first, copy/pasted example not provide the data for "0" in the graph

and

(2) Why does my example combine the "23" and "0" bin

would be greatly appreciated.

EDIT -- The following code fixed the issue, but gave a warning. I do not think the warning is a problem, but am curious if anyone can interpret this. I believe the original issue is with R interpreting a break as (val1, val2] and not [val1, val2] like I expected. As such, the 0 groupings were never kept. Changing my breaks to seq(-1:23) is now inclusive on all values 0 through 23.

The fix :

gplotAll = ggplot( eventdataAll, aes(x=eventdataAll$eventhour) ) + 
  geom_histogram(breaks=seq(-1,24), colour="purple") + coord_polar(start=0) +
  theme_minimal() + scale_fill_manual(values="purple") + ylab("Frequency") + 
  ggtitle("All Sources") + scale_x_continuous("", limits=c(-1,23), 
  breaks=seq(-1,23), labels=seq(0,2) )

The warning :

Removed 1 rows containing missing values (geom_bar). 
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Jibril
  • 967
  • 2
  • 11
  • 29
  • The fix you posted gives me the error `Error: breaks and labels must have the same length` – Mako212 Jul 07 '17 at 16:24
  • The warning means that ggplot actually removed data (probably the data at 23 hours) from the plot, so not all the data are being included. See [this answer](https://stackoverflow.com/a/32506068/496488) for an explanation. – eipi10 Jul 08 '17 at 18:45

1 Answers1

1

Your data are continuous in the sense that eventhour is numeric, but eventhour behaves like an ordered categorical variable because all of the eventhour values are integer hours of the day. Thus, you can do the plot with either geom_bar or geom_histogram. I'll use geom_bar (which is for categorical data) in the examples below and then show a version with geom_histogram (which is for continuous data) at the end.

The issue you're having with geom_bar is due to the fact that the bars are placed on integer hours, but have a default width of 1 unit. This means that, due to it's finite width, the bar at zero (midnight) extends below zero by 0.5 units. When you set the x-axis limit to be zero, the bar at zero gets excluded, because scale_x_continuous excludes data that are outside the limits range. However, if you set the limit to be at -0.5, then the polar scale is no longer a 0 - 24 hour clock.

I'm going to show multiple examples below, but they all have some similar plot elements, so let's save those common elements in an object that we can reuse.

my_plot = list(coord_polar(start=0),
               geom_bar(colour="grey"),
               theme_minimal(),
               scale_fill_brewer(),
               ylab("Count"))

Now let's see what happens with geom_bar without coord_polar (we exclude the coord_polar statment in my_plot by doing my_plot[-1]). Note how the bar for midnight (eventhour = 0) is excluded when the width of the bar extends outside the x-range (first plot below) but appears when we extend the range to -0.5 (second plot below), which is the edge of the bar.

ggplot(eventdata, aes(x = eventhour)) + 
  my_plot[-1] +
  scale_x_continuous(limits = c(0,24), breaks=0:23) +
  ggtitle("x-limits: c(0,24)")

ggplot(eventdata, aes(x = eventhour)) + 
  my_plot[-1] +
  scale_x_continuous(limits = c(-0.5,24), breaks=0:23) + 
  ggtitle("x-limits: c(-0.5,24)")

enter image description here

Now let's add coord_polar to the second plot above. The code is just below and the plot is on the left below. Note that 0 is now rotated clockwise and there's an extra half-hour wedge before 0.

To fix these issues, we'll change the coord_polar statement to rotate the plot counter-clockwise by 7.5 degrees (1/48 of the circle) and remove half an hour from the other end of the plot by changing the limit to 23.5 instead of 24. This doesn't remove any data because the highest hour value is 23.

We also remove the minor gridlines to get rid of an unwanted gridline that would otherwise appear at 23.5 hours. There are actually minor gridlines at all of the half-hours, but this one gets plotted twice (because it represents both -0.5 and 23.5 hours) and is therefore more prominent than the others. We don't really need minor gridlines here, so we just get rid of them completely.

The code for this plot is the second ggplot block below and the plot is on the right.

ggplot(eventdata, aes(x = eventhour)) + 
  my_plot +
  scale_x_continuous(limits = c(-0.5,24), breaks=0:23) +
  ggtitle("x-limits: c(-0.5,24)")

ggplot(eventdata, aes(x = eventhour)) + 
  my_plot[-1] +
  scale_x_continuous(limits = c(-0.5,23.5), breaks=0:23) +
  ggtitle("x-limits: c(-0.5,23.5)")  +
  coord_polar(start=-48/360) +
  theme(panel.grid.minor=element_blank())

enter image description here

So, the final plot code is:

ggplot(eventdata, aes(x=eventhour)) + 
  geom_bar(colour="grey") + 
  theme_minimal() +
  scale_fill_brewer() +
  ylab("Count") +
  coord_polar(start=-48/360) +
  scale_x_continuous(limits=c(-0.5,23.5), breaks=0:23) +
  theme(panel.grid.minor=element_blank())

The equivalent plot with geom_histogram is below. binwidth=1 means that each bar will be 1 hour wide. center=0 ensures that each bar will be centered on a whole number (we could have chosen any whole number here instead of 0). In some cases it also matters whether the bins are closed on the left or the right (it would matter here if we set, say, center=0.5). You can set that with the closed argument; closed="right" or closed="left".

ggplot(eventdata, aes(x=eventhour)) + 
  geom_histogram(colour="grey", center=0, binwidth=1) + 
  theme_minimal() +
  scale_fill_brewer() +
  ylab("Count") +
  coord_polar(start=-48/360) +
  scale_x_continuous(limits=c(-0.5,23.5), breaks=0:23) +
  theme(panel.grid.minor=element_blank())
eipi10
  • 91,525
  • 24
  • 209
  • 285