0

I'm hoping to get some help on making the following histogram looks as nice and understandable as possible. I am plotting the salaries of Immigrant versus US Born workers. I am wondering 1. How would you modify colors, axis intervals, etc. to make the graph more clear/appealing? 2. How could I add a key to indicate purple is for US born workers, and pink is for foreign born? 3. How can I add two different lines to indicate the median of each group? And a corresponding label for each?

My current code is set up as this:

 ggplot(NHIS1,aes(x=adj_SALARY, y=..density..)) +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='0'), alpha=.5,binwidth=800, fill="purple",position="identity") + xlim(4430.4,50000) + 
    geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed") + 
geom_histogram(data=subset(NHIS1,IMMIGRANT=='1'), alpha=.5,binwidth=800,fill="red") + xlim(4430.4,50000)
    geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed") 

And my final histogram at the moment appears as this:

enter image description here

juliah0494
  • 175
  • 11

2 Answers2

1

If you have two variables, one for income , one for immigrant status, you do not need to plot two histograms but one will suffice if you specify the grouping. Also, I'd suggest you also use density lines, which help smooth over the histogram's bumps:

Assuming this is roughly like your data:

df <- data.frame(income = sample(1000:5000, 1000),
                 born = sample(c("US", "Foreign"), 1000, replace = T))

Then a crude way to plot one histogram as well as density lines for the two groups would be this:

ggplot(df, aes(x=income, color=born, fill=born)) + 
  geom_histogram(aes(y=..density..), alpha=0.5, binwidth=100,
                 position="identity") +
  geom_density(alpha=.2)
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

This question has been asked before: overlaying-histograms-with-ggplot2-in-r discusses several options with many examples. You should definitely take a look at it.

Another option to compare the distributions could be violin plots using geom_violin(). I see violin plots as the better option when you need to compare distributions because they give you more flexibility and are still clearer. But that may be just me. Refer to the examples in the manual.

Jan
  • 4,974
  • 3
  • 26
  • 43