3

I want a nice density (that sums to 1) histogram for some discrete data. I have tried a couple of ways to do this, but none were entirely satisfactory.

Generate some data:

#data
set.seed(-999)
d.test = data.frame(score = round(rnorm(100,1)))
mean.score = mean(d.test[,1])
d1 = as.data.frame(prop.table(table(d.test)))

The first gives the right placement of bars -- centered on top of the number -- but the wrong placement of vline(). This is because the x-axis is discrete (factor) and so the mean is plotted using the number of levels, not the values. The mean value is .89.

ggplot(data=d1, aes(x=d.test, y=Freq)) +
  geom_bar(stat="identity", width=.5) +
  geom_vline(xintercept=mean.score, color="blue", linetype="dashed")

enter image description here

The second gives the correct vline() placement (because the x-axis is continuous), but wrong placement of bars and the width parameter does not appear to be modifiable when x-axis is continuous (see here). I also tried the size parameter which also has no effect. Ditto for hjust.

ggplot(d.test, aes(x=score)) +
  geom_histogram(aes(y=..count../sum(..count..)), width=.5) +
  geom_vline(xintercept=mean.score, color="blue", linetype="dashed")

enter image description here

Any ideas? My bad idea is to rescale the mean so that it fits with the factor levels and use the first solution. This won't work well in case some of the factor levels are 'missing', e.g. 1, 2, 4 with no factor for 3 because no datapoint had that value. If the mean is 3.5, rescaling this is odd (x-axis is no longer an interval scale).

Another idea is this:

ggplot(d.test, aes(x=score)) +
  stat_bin(binwidth=.5, aes(y= ..density../sum(..density..)), hjust=-.5) +
  scale_x_continuous(breaks = -2:5) + #add ticks back
  geom_vline(xintercept=mean.score, color="blue", linetype="dashed")

But this requires adjusting the breaks, and the bars are still in the wrong positions (not centered). Unfortunately, hjust does not appear to work.

enter image description here

How do I get everything I want?

  • density sums to 1
  • bars centered above values
  • vline() at the correct number
  • width=.5

With base graphics, one could perhaps solve this problem by plotting twice on the x-axis. Is there some similar way here?

Community
  • 1
  • 1
CoderGuy123
  • 6,219
  • 5
  • 59
  • 89

1 Answers1

3

It sounds like you just want to make sure that your x-axis values are numeric rather than factors

ggplot(data=d1, aes(x=as.numeric(as.character(d.test)), y=Freq)) +
  geom_bar(stat="identity", width=.5) +
  geom_vline(xintercept=mean.score, color="blue", linetype="dashed") + 
  scale_x_continuous(breaks=-2:3)

which gives

enter image description here

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Silly that I did not inspect the `data.frame` to see which type `prop.table()` has given it. It outputs `character`, and `data.frame()` thus converts that to `factor` since `stringsAsFactors=F` wasn't set. – CoderGuy123 May 04 '15 at 22:06
  • 1
    @Deleet The inverse option of this (of sorts) would be to plot the vertical line at the weighted mean of the factor levels: `with(d1,weighted.mean(as.integer(d.test),w = Freq))`. – joran May 04 '15 at 22:10
  • @Joran that might work, but it would give strange results if some levels were not present (e.g. due to sampling error in small datasets). – CoderGuy123 May 04 '15 at 22:12
  • @Deleet Possibly, but I think it should be fine as long as the frequencies sum to 1. – joran May 04 '15 at 22:15