ggplot2 can't draw a correct plot with only two or three data points

Question

I'm using R to generate some plots of some metrics and getting nice results like this for data that has > 3 data points:

However, I'm noticing that for data with only a few values - I get very poor results.

If I draw a plot with only two data points, I get a blank plot. foo_two_points.dat

cluster,account,current_database,action,operation,count,day
cluster19,col0063,col0063,foo_two,two_bar,10,2016-10-04 00:00:00-07:00
cluster61,dwm4944,dwm4944,foo_two,two_bar,2,2016-12-14 00:00:00-08:00

If I draw one data point, it works.
foo_one_point.dat

cluster,account,current_database,action,operation,count,day
cluster1,foo0424,foo0424,fooone,,2,2016-11-01 00:00:00-07:00

Three, it almost works, but isn't accurate.
foo_three_points.dat

cluster,account,current_database,action,operation,count,day
cluster23,col2225,col2225,foo_three,bar,9,2016-12-22 00:00:00-08:00
cluster23,col2225,col2225,foo_three,bar,1,2016-12-29 00:00:00-08:00
cluster12,red1782,red1782,foo_three,bar,2,2016-10-25 00:00:00-07:00

4, 5, etc. all seem fine

But two or three points - nope.

Here is my plot.r file:

library(ggplot2)
library(scales)

args<-commandArgs(TRUE)

filename<-args[1]
n = nchar(filename) - 4
thetitle = substring(filename, 1, n)
print(thetitle)
png_filename <- stringi::stri_flatten(stringi::stri_join(c(thetitle,'.png')))

wide<-as.numeric(args[2])
high<-as.numeric(args[3])
legend_left<-as.numeric(args[4])

pos <- if(legend_left == 1) c(1,0)  else c(0,1) 
place <- if(legend_left == 1) 'left'  else 'right'

print(wide)
print(high)

print(filename)
print(png_filename)

dat = read.csv(filename)

dat$account = as.character(dat$account)
dat$action=as.character(dat$action)
dat$operation = as.character(dat$operation)
dat$count = as.integer(dat$count)
dat$day = as.Date(dat$day)
dat[is.na(dat)]<-"N/A"

png(png_filename,width=wide,height=high)

p <- ggplot(dat, aes(x=day, y=count, fill=account, labels=TRUE)) 
p <- p + geom_histogram(stat="identity") 
p <- p + scale_x_date(labels=date_format("%b-%Y"), limits=as.Date(c('2016-10-01','2017-01-01')))
p <- p + theme(legend.position="bottom")
p <- p + guides(fill=guide_legend(nrow=5, byrow=TRUE))
p <- p + theme(text = element_text(size=15)) 
p<-p+labs(title=thetitle)

print(p)

dev.off()

Here's the command I use to run it:

RScript plot.r foo_five_points.dat 1600 800 0

What am I doing wrong?

Maybe you need `geom_bar` instead of `geom_histogram()`? one or two points with a histogram doesn't seem like a good idea. — Psidom, Feb 10 '17 at 19:13
Well, for other reports with many many more rows of data - I get a stacked histogram which is what I want. This is for an automated process, so I don't want to have to finesse the plot for particular small amounts of data — slashdottir, Feb 10 '17 at 19:14
@Psidom is correct - if you have a count variable already, you don't need a histogram. See: http://stackoverflow.com/questions/31408506/make-a-histogram-whos-frequency-is-a-value-in-the-row/31408618#31408618 — C8H10N4O2, Feb 10 '17 at 19:15
These are a few examples of erroneous plots. I have hundreds of others all using the same script. And as I mentioned, if the number of values is < 2 or > 4, it looks ok. I don't want to have to finesse the graphs, I just want to run one script. I think this is a bug in R or ggplot2, not user error — slashdottir, Feb 10 '17 at 19:17
@slashdottir you might as well provide us with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), since you'd need to do that for a bug report anyway — C8H10N4O2, Feb 10 '17 at 19:20
Already did? All the code to produce the graph is already in the question — slashdottir, Feb 10 '17 at 19:21
@slashdottir you did not provide a "minimal dataset and minimal runnable code" - please read my link. Also your example is very confusing because... — C8H10N4O2, Feb 10 '17 at 19:22
...you have a column called `count` in your dataset and then `geom_histogram` aggregates these records by count (not sum) into another variable called `count`. Is that really what you want? I'm not a mind reader. — C8H10N4O2, Feb 10 '17 at 19:23
The datasets are provided - directly underneath each plot. The plot.r script is there and so is the command line call to run it. What do you feel is missing? — slashdottir, Feb 10 '17 at 19:25
Your problem is the `limits` and the bar is out of the range of the `limits` you set up. If you expand the limits to a larger range, the bar appears. — Psidom, Feb 10 '17 at 19:44
@Psidom Can you give specifics. I cannot see any data that is outside the limit? — slashdottir, Feb 10 '17 at 19:47
Your data is not outside the range, but when you use histogram, it will cut your variable and redefine the bucket for your data. I do not know how to fix it, but `limits=as.Date(c('2016-09-01','2017-02-10'))` will show the bars. — Psidom, Feb 10 '17 at 19:49
Hmm. I tried your suggestion and get two really fat bars that don't reflect the data accurately. — slashdottir, Feb 10 '17 at 19:52
It does seem like a bug, considering the difference between the behavior of one point and two points plot. — Psidom, Feb 10 '17 at 19:55

Mike Wise · Answer 1 · 2017-02-12T17:32:14.560

0

I don't know if this is a bug, I think it is actually by design and the bars are getting clipped as they spill over into the limits.

I also think this is more of a geom_bar than a geom_histogram as this doesn't seem to be distribution data, but that is irrelevant to the issue, both behave the same.

One solution it is to set the width parameter explicitly in geom_histo instead of letting it be calculated:

p <- ggplot(dat, aes(x=day, y=count, fill=account, labels=TRUE)) 
p <- p + geom_histogram(stat="identity",width=1) 
p <- p + scale_x_date(labels=date_format("%b-%Y"), limits=as.Date(c('2016-10-1','2017-01-01')))
p <- p + theme(legend.position="bottom")
p <- p + guides(fill=guide_legend(nrow=5, byrow=TRUE))
p <- p + theme(text = element_text(size=15)) 
p<-p+labs(title=thetitle)

Then your two point example that is blank above gives you this - which seems right:

Can't be sure that setting the width explicitly will work when you have a lot of data though and the bars keep needing to get smaller - I suppose you could set it conditionally.

edited Feb 12 '17 at 17:32

answered Feb 12 '17 at 05:20

Mike Wise

22,131
8
81
104

Thanks, looks like you got something that works. I will try it. Although I can't see how the bars are getting clipped by the limits as the data is inside it. – slashdottir Feb 13 '17 at 00:52
Probably clipping using the edges of the bars as the criterion. I suppose that could be considered a bug. Could file it on github and see what Hadley and co say about it. – Mike Wise Feb 14 '17 at 16:17
stopped caring - just lost my job.. :/ – slashdottir Feb 21 '17 at 19:17
Sorry to hear that. Hope you get something better. – Mike Wise Mar 14 '17 at 21:28

ggplot2 can't draw a correct plot with only two or three data points

1 Answers1