2

I'm using R to generate some plots of some metrics and getting nice results like this for data that has > 3 data points:

enter image description here

However, I'm noticing that for data with only a few values - I get very poor results.

If I draw a plot with only two data points, I get a blank plot. enter image description here foo_two_points.dat

cluster,account,current_database,action,operation,count,day
cluster19,col0063,col0063,foo_two,two_bar,10,2016-10-04 00:00:00-07:00
cluster61,dwm4944,dwm4944,foo_two,two_bar,2,2016-12-14 00:00:00-08:00

If I draw one data point, it works.
enter image description here foo_one_point.dat

cluster,account,current_database,action,operation,count,day
cluster1,foo0424,foo0424,fooone,,2,2016-11-01 00:00:00-07:00

Three, it almost works, but isn't accurate.
enter image description here foo_three_points.dat

cluster,account,current_database,action,operation,count,day
cluster23,col2225,col2225,foo_three,bar,9,2016-12-22 00:00:00-08:00
cluster23,col2225,col2225,foo_three,bar,1,2016-12-29 00:00:00-08:00
cluster12,red1782,red1782,foo_three,bar,2,2016-10-25 00:00:00-07:00

4, 5, etc. all seem fine enter image description here

But two or three points - nope.

Here is my plot.r file:

library(ggplot2)
library(scales)

args<-commandArgs(TRUE)

filename<-args[1]
n = nchar(filename) - 4
thetitle = substring(filename, 1, n)
print(thetitle)
png_filename <- stringi::stri_flatten(stringi::stri_join(c(thetitle,'.png')))

wide<-as.numeric(args[2])
high<-as.numeric(args[3])
legend_left<-as.numeric(args[4])

pos <- if(legend_left == 1) c(1,0)  else c(0,1) 
place <- if(legend_left == 1) 'left'  else 'right'

print(wide)
print(high)

print(filename)
print(png_filename)

dat = read.csv(filename)

dat$account = as.character(dat$account)
dat$action=as.character(dat$action)
dat$operation = as.character(dat$operation)
dat$count = as.integer(dat$count)
dat$day = as.Date(dat$day)
dat[is.na(dat)]<-"N/A"

png(png_filename,width=wide,height=high)

p <- ggplot(dat, aes(x=day, y=count, fill=account, labels=TRUE)) 
p <- p + geom_histogram(stat="identity") 
p <- p + scale_x_date(labels=date_format("%b-%Y"), limits=as.Date(c('2016-10-01','2017-01-01')))
p <- p + theme(legend.position="bottom")
p <- p + guides(fill=guide_legend(nrow=5, byrow=TRUE))
p <- p + theme(text = element_text(size=15)) 
p<-p+labs(title=thetitle)

print(p)

dev.off()

Here's the command I use to run it:

RScript plot.r foo_five_points.dat 1600 800 0

What am I doing wrong?

slashdottir
  • 7,835
  • 7
  • 55
  • 71
  • What do you mean, there are no bars? – slashdottir Feb 10 '17 at 19:12
  • 2
    Maybe you need `geom_bar` instead of `geom_histogram()`? one or two points with a histogram doesn't seem like a good idea. – Psidom Feb 10 '17 at 19:13
  • Well, for other reports with many many more rows of data - I get a stacked histogram which is what I want. This is for an automated process, so I don't want to have to finesse the plot for particular small amounts of data – slashdottir Feb 10 '17 at 19:14
  • 1
    @Psidom is correct - if you have a count variable already, you don't need a histogram. See: http://stackoverflow.com/questions/31408506/make-a-histogram-whos-frequency-is-a-value-in-the-row/31408618#31408618 – C8H10N4O2 Feb 10 '17 at 19:15
  • These are a few examples of erroneous plots. I have hundreds of others all using the same script. And as I mentioned, if the number of values is < 2 or > 4, it looks ok. I don't want to have to finesse the graphs, I just want to run one script. I think this is a bug in R or ggplot2, not user error – slashdottir Feb 10 '17 at 19:17
  • 1
    @slashdottir you might as well provide us with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), since you'd need to do that for a bug report anyway – C8H10N4O2 Feb 10 '17 at 19:20
  • Already did? All the code to produce the graph is already in the question – slashdottir Feb 10 '17 at 19:21
  • @slashdottir you did not provide a "minimal dataset and minimal runnable code" - please read my link. Also your example is very confusing because... – C8H10N4O2 Feb 10 '17 at 19:22
  • ...you have a column called `count` in your dataset and then `geom_histogram` aggregates these records by count (not sum) into another variable called `count`. Is that really what you want? I'm not a mind reader. – C8H10N4O2 Feb 10 '17 at 19:23
  • The datasets are provided - directly underneath each plot. The plot.r script is there and so is the command line call to run it. What do you feel is missing? – slashdottir Feb 10 '17 at 19:25
  • Where do you feel a 'sum' is required? Not following – slashdottir Feb 10 '17 at 19:27
  • Your problem is the `limits` and the bar is out of the range of the `limits` you set up. If you expand the limits to a larger range, the bar appears. – Psidom Feb 10 '17 at 19:44
  • @Psidom Can you give specifics. I cannot see any data that is outside the limit? – slashdottir Feb 10 '17 at 19:47
  • 1
    Your data is not outside the range, but when you use histogram, it will cut your variable and redefine the bucket for your data. I do not know how to fix it, but `limits=as.Date(c('2016-09-01','2017-02-10'))` will show the bars. – Psidom Feb 10 '17 at 19:49
  • Hmm. I tried your suggestion and get two really fat bars that don't reflect the data accurately. – slashdottir Feb 10 '17 at 19:52
  • 1
    It does seem like a bug, considering the difference between the behavior of one point and two points plot. – Psidom Feb 10 '17 at 19:55
  • 1
    @Psidom Thanks, at least I know that it's not my script – slashdottir Feb 10 '17 at 20:03

1 Answers1

0

I don't know if this is a bug, I think it is actually by design and the bars are getting clipped as they spill over into the limits.

I also think this is more of a geom_bar than a geom_histogram as this doesn't seem to be distribution data, but that is irrelevant to the issue, both behave the same.

One solution it is to set the width parameter explicitly in geom_histo instead of letting it be calculated:

p <- ggplot(dat, aes(x=day, y=count, fill=account, labels=TRUE)) 
p <- p + geom_histogram(stat="identity",width=1) 
p <- p + scale_x_date(labels=date_format("%b-%Y"), limits=as.Date(c('2016-10-1','2017-01-01')))
p <- p + theme(legend.position="bottom")
p <- p + guides(fill=guide_legend(nrow=5, byrow=TRUE))
p <- p + theme(text = element_text(size=15)) 
p<-p+labs(title=thetitle)

Then your two point example that is blank above gives you this - which seems right:

enter image description here

Can't be sure that setting the width explicitly will work when you have a lot of data though and the bars keep needing to get smaller - I suppose you could set it conditionally.

Mike Wise
  • 22,131
  • 8
  • 81
  • 104