22

I'm having quite the time understanding geom_bar() and position="dodge". I was trying to make some bar graphs illustrating two groups. Originally the data was from two separate data frames. Per this question, I put my data in long format. My example:

test <- data.frame(names=rep(c("A","B","C"), 5), values=1:15)
test2 <- data.frame(names=c("A","B","C"), values=5:7)

df <- data.frame(names=c(paste(test$names), paste(test2$names)), num=c(rep(1, 
nrow(test)), rep(2, nrow(test2))), values=c(test$values, test2$values))

I use that example as it's similar to the spend vs. budget example. Spending has many rows per names factor level whereas the budget only has one (one budget amount per category).

For a stacked bar plot, this works great:

ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity")

stacked plot

In particular, note the y value maxes. They are the sums of the data from test with the values of test2 shown on blue on top.

Based on other questions I've read, I simply need to add position="dodge" to make it a side-by-side plot vs. a stacked one:

ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) + 
geom_bar(stat="identity", position="dodge")

dodged

It looks great, but note the new max y values. It seems like it's just taking the max y value from each names factor level from test for the y value. It's no longer summing them.

Per some other questions (like this one and this one, I also tried adding the group= option without success (produces the same dodged plot as above):

ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(num))) +
geom_bar(stat="identity", position="dodge")

I don't understand why the stacked works great and the dodged doesn't just put them side by side instead of on top.


ETA: I found a recent question about this on the ggplot google group with the suggestion to add alpha=0.5 to see what's going on. It isn't that ggplot is taking the max value from each grouping; it's actually over-plotting bars on top of one another for each value.

It seems that when using position="dodge", ggplot expects only one y per x. I contacted Winston Chang, a ggplot developer about this to confirm as well as to inquire if this can be changed as I don't see an advantage.

It seems that stat="identity" should tell ggplot to tally the y=val passed inside aes() instead of individual counts which happens without stat="identity" and when passing no y value.

For now, the workaround seems to be (for the original df above) to aggregate so there's only one y per x:

df2 <- aggregate(df$values, by=list(df$names, df$num), FUN=sum)
p <- ggplot(df2, aes(x=Group.1, y=x, fill=factor(Group.2)))
p <- p + geom_bar(stat="identity", position="dodge")
p

correct

Community
  • 1
  • 1
Hendy
  • 10,182
  • 15
  • 65
  • 71

1 Answers1

24

I think the problem is that you want to stack within values of the num group, and dodge between values of num. It might help to look at what happens when you add an outline to the bars.

library(ggplot2)
set.seed(123)
df <- data.frame(
  id     = 1:18,
  names  = rep(LETTERS[1:3], 6),
  num    = c(rep(1, 15), rep(2, 3)),
  values = sample(1:10, 18, replace=TRUE)
)

By default, there are a lot of bars stacked - you just don't see that they're separate unless you have an outline:

# Stacked bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) + 
  geom_bar(stat="identity", colour="black")

Stacked bars

If you dodge, you get bars that are dodged between values of num, but there may be multiple bars within each value of num:

# Dodged on 'num', but some overplotted bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) + 
  geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)

Dodged on num

If you also add id as a grouping var, it'll dodge all of them:

# Dodging with unique 'id' as the grouping var
ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(id))) + 
  geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)

Dodge all bars

I think what you want is to both dodge and stack, but you can't do both. So the best thing is to summarize the data yourself.

library(plyr)
df2 <- ddply(df, c("names", "num"), summarise, values = sum(values))

ggplot(df2, aes(x=factor(names), y=values, fill=factor(num))) + 
  geom_bar(stat="identity", colour="black", position="dodge")

Summarized beforehand

wch
  • 4,069
  • 2
  • 28
  • 36
  • Got it. It was quite helpful for you to point out that I'm actually asking for both dodging and stacking. One quibble: when not using `stat="identity"` (so basically making a histogram), isn't ggplot "stacking" individual counts while dodging between some other characteristic? Even so, I'm okay with the answer that it's just how it works at the moment. I thought I was doing something wrong in my code! – Hendy Jul 23 '12 at 21:48
  • 2
    `geom_bar` can be a little confusing because it's used for two different purposes: Sometimes it's used to plot y values that you provide, and sometimes it counts up the number of cases in each and uses that count as the y value (with `stat="bin"`). The latter behavior is the default (you can see it with `ggplot(df, aes(x=factor(names), fill=factor(num))) + geom_bar(colour="black")` ). In this case, the "stacking" is not quite the same -- it's a summary _stat_, whereas the usual stacking is a _position adjustment_. These things happen at different stages of the ggplot pipeline. – wch Jul 24 '12 at 00:59
  • Thanks for the explanation. Aggregating is not a big deal and now I know I need to, which is a big step from just being confused :) – Hendy Jul 24 '12 at 18:25