0

I am new to R and ggplot2. I am using the College dataset found in the ISLR package. When I make a histogram plot and fill it with aes(fill=Private), I get the following plot. enter image description here This plot is highly misleading, because if I create a table of Private I get

No Yes

212 565

but the histogram created from ggplot2 can be interpreted as having more "No" than "Yes". The reference course has the following figure, which correctly depicts the number of "Yes" and "No", but according to the creator of the histogram, it was generated with an older version of ggplot2. Note, I see the difference in statistics, but this does not not take away from the objective of this post. enter image description here

The question is how to generate this histogram to produce a plot that depicts the proper "Yes" and "No" as see in the second histogram with the new version of ggplot2?

I have look at several other SO posts such asfor stacks,more stacks,for barplots, here, and from ggplot2, but I have not seen an answer for histograms. I have tried using arrange with the dplyr package ans well as the R order function, but to no avail.

Here is my R code

library(ISLR)
library(ggplot2)
library(dplyr)
df<-College

ggplot(df,aes(F.Undergrad))+geom_histogram(aes(fill=Private),bins = 50,color='black',alpha=0.5)+theme_bw()
mkunkel
  • 243
  • 3
  • 16
  • Which code produced the first plot? – Tino Jan 10 '18 at 13:53
  • Thanks for the reply, the code that generated the first plot was the code provided at the end of the post. – mkunkel Jan 10 '18 at 13:58
  • See https://stackoverflow.com/questions/6957549/overlaying-histograms-with-ggplot2-in-r – tifu Jan 10 '18 at 14:07
  • Thanks tifu for that post, but I do not want to create 2 different histograms, I want to create one histogram and have the proper number of values for the "Yes" and "No" overlay proportionally to their statistics using aes. – mkunkel Jan 10 '18 at 14:10
  • @mkunkel what you got in the first plot was a proper histogram, but it is stack. Try `ggplot(df, aes(x = F.Undergrad, fill = Private)) + geom_histogram(position = position_dodge(), color = "black")`, if you don't want to stack values. – Tino Jan 10 '18 at 14:13
  • @tino thanks, but the dodge option can also lead to misinterpretation of the x axis and the identity option can also lead to misinterpretation of the y axis, since y is still stacked – mkunkel Jan 10 '18 at 14:24
  • @mkunkel which seems logic to me, since for each bar you see the shares. But if you use "identity", make sure you always combine it with some transparency like, as @brettlausn suggests, `alpha = .5`, because otherwise one could assume that there is no observation in one series if it is overlapped by the other... That's why I like the stacked version and maybe that's why it is default. – Tino Jan 10 '18 at 14:40
  • @Tino I guess it depends on which discipline you are more associated with. In particle physics a stacked plot is not ideal as it does not properly show the background of a competing process. – mkunkel Jan 10 '18 at 16:37

2 Answers2

2

If I understood you correctly, all you need to do is reorder the factor levels of df$Private:

df$Private <- relevel(df$Private, "Yes")

ggplot(df, aes(F.Undergrad)) +
  geom_histogram(aes(fill = Private),
                 bins = 50,
                 color = 'black',
                 alpha = 0.5) +
  theme_bw()

enter image description here

The Information is essentially the same, because the bars are STACKED. If you don't want that you should follow the advice from @Tino and use position = "dodge"

ggplot(df, aes(F.Undergrad)) +
  geom_histogram(aes(fill = Private),
                 bins = 50,
                 color = 'black',
                 alpha = 0.5,
                 position = "dodge") +
  theme_bw()

enter image description here

with position = identity:

ggplot(df, aes(F.Undergrad)) +
  geom_histogram(aes(fill = Private),
                 bins = 50,
                 color = 'black',
                 alpha = 0.5,
                 position = "identity") +
  theme_bw()

enter image description here

f.lechleitner
  • 3,554
  • 1
  • 17
  • 35
  • Thanks @brettjausn. The first solution was what I was looking for. I do not like the dodge as this leads to misinterpretation of the x-axis and I do like like position="identity" because of the stacking feature. Your first solution will be accepted. But can I ask why this stacking is default? Doesn't it lead to more misinterpretations that to leave it off? – mkunkel Jan 10 '18 at 14:20
  • beware, identity and stack are not the same thing, see my edited answer. I don't know why `"stack"` is the standard position for `geom_histogram()`, I guess it's the most commonly used way of plotting histograms so they've set it as the default. Also, a quick position reference: http://sape.inf.usi.ch/quick-reference/ggplot2/position – f.lechleitner Jan 10 '18 at 14:31
  • 2
    Haha, now I got it. You could also achieve that by simply adding `position = position_stack(reverse = TRUE)` inside `geom_histogram`. – Tino Jan 10 '18 at 14:33
  • @brettljausn Thanks. I meant to say that I do NOT like position="identity", as you can see see from the last plot you provided, the y-axis is not leading to the actual stats of each feature. – mkunkel Jan 10 '18 at 14:37
0
plot1<- ggplot(df, aes(F.Undergrad, fill=factor(Private, levels = c("Yes", "No")))) +  geom_histogram(binwidth=5)+
 scale_fill_manual(values=c("#d7191c", "#fdae61"), 
                           name="Private", 
                           labels=c("Yes","No"))+ 

I used this to order my histograms in levels. Might work on your code.

Irene
  • 35
  • 4