0

I have a problem with my density histogram in ggplot2. I am working in RStudio, and I am trying to create density histogram of income, dependent on persons occupation. My problem is, that when I use my code:

data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
        sep=",",header=F,col.names=c("age", "type_employer", "fnlwgt", "education", 
                "education_num","marital", "occupation", "relationship", "race","sex",
                "capital_gain", "capital_loss", "hr_per_week","country", "income"),
        fill=FALSE,strip.white=T)

ggplot(data=dat, aes(x=income)) + 
  geom_histogram(stat='count', 
                 aes(x= income, y=stat(count)/sum(stat(count)), 
                     col=occupation, fill=occupation),
                 position='dodge')

I get in response histogram of each value divided by overall count of all values of all categories, and I would like for example for people earning >50K whom occupation is 'craft repair' divided by overall number of people whos occupation is craft-repair, and the same for <=50K and of the same occupation category, and like that for every other type of occupation

And the second question is, after doing propper density histogram, how can I sort the bars in decreasing order?

Ra-v
  • 1
  • 1
  • 1
    Please edit your question to include data and make it reproducible, either by using `dput()` to share your data, or by reproducing your problem with a built in or similated dataframe. See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for tips on including data in r questions. – Jan Boyer Jan 11 '19 at 18:05
  • 1
    Reconsidering using a histogram. A histogram is plotting density or frequency, and in this case you are comparing two bins: <=50K and >50K. `geom_bar()` might be a better option here. – OTStats Jan 11 '19 at 18:13

1 Answers1

2

This is a situation where it makes sence to re-aggregate your data first, before plotting. Aggregating within the ggplot call works fine for simple aggregations, but when you need to aggregate, then peel off a group for your second calculation, it doesn't work so well. Also, note that because your x axis is discrete, we don't use a histogram here, instead we'll use geom_bar()

First we aggregate by count, then calculate percent of total using occupation as the group.

d2 <- data %>% group_by(income, occupation) %>% 
  summarize(count= n()) %>% 
  group_by(occupation) %>% 
  mutate(percent = count/sum(count))

Then simply plot a bar chart using geom_bar and position = 'dodge' so the bars are side by side, rather than stacked.

 d2 %>% ggplot(aes(income, percent, fill = occupation)) + 
   geom_bar(stat = 'identity', position='dodge')

enter image description here

Mako212
  • 6,787
  • 1
  • 18
  • 37
  • amazing, thanks! Could you help me with second part of question (How can I sort these bars now in descending order, separately for >50K and <=50K)? – Ra-v Jan 15 '19 at 13:12
  • @Ra-v that belongs in a new question, and there are some existing questions that already address sorting dodged bars. – Mako212 Jan 15 '19 at 17:23