Histogram/bar chart containing two variables in bar

Question

I want to make a histogram / bar chart that looks similar to the plot below:

Click here

I have the following code

d1 <- read.table("Session_data_TU2010AND15.csv", header = TRUE, sep = ";")
d <- d1[,c("IncHouseh","HousehNumcars")]

The first variable IncHouseh is the income of different households. These should be shown in intervals on the x-axis, while HousehNumcars (the number of cars in household) should be the percentage shown in the bar for each interval.

The data d looks like this, however with more than 20000 rows:

      IncHouseh HousehNumcars
1           800             2
2           384             2
4           638             1
5           580             2
6           700             2
7           744             2
8           560             1
9           500             1
10          686             1
11          310             1
12          510             1
13          648             2
14          372             1
15          542             1

As I am new to r, I find it very difficult to be able to illustrate something similar to the link provided above. Thanks for your help!

EDIT: After following massisenergy's code below (big thanks), I've managed to get this figure (which is correct):

Please provide `Session_data_TU2010AND15.csv`, so that others can try on your problem. No one here has any idea how it looks, except you! — massisenergy, Mar 01 '20 at 11:37
Hi, I'm not allowed to upload the entire dataset, but I've provided some of the data points in the post. — nostres, Mar 01 '20 at 11:42
Like this? https://stackoverflow.com/questions/20184096/how-to-plot-multiple-stacked-histograms-together-in-r — R. Schifini, Mar 01 '20 at 11:52
Your graph is not a histogram. It looks more like a stacked relative frequency bar chart. So don't bother with the `hist` function. Look at `barplot` or `geom_bar` from the ggplot2 universe. — Edward, Mar 01 '20 at 12:07

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

You could first use cut to categorize the income data.

dat$IncHouseh.c=cut(dat$IncHouseh, seq(1e3, 5e3, 1e3), 
                    labels=c("10k-20k", "20k-30k", "30k-40k", "40k-50k"))

Then second, for the aggregation of percentage of number of cars you could use prop.table(table(x))) in a tapply.

agg <- do.call(rbind, with(t(dat), tapply(HousehNumcars, IncHouseh.c, FUN=function(x)
  prop.table(table(x)))))

Third, plot it!

op <- par(mar=c(5, 5, 4, 6), xpd=TRUE)                   ## expand outer margins
b <- barplot(agg, xaxt="n", col=2:5,                     ## assign position output to `b`
             xlab="Income", ylab="Probability", main="Cars in households")
mtext(rownames(agg), 1, 1, at=b)                         ## use `b` for label positioning
legend(5, 1, title="cars", col=5:2, pch=15, legend=3:0)  ## legend
par(op)

Note, that dat needs to be transposed.

Result

Data:

set.seed(42)
dat <- data.frame(IncHouseh=sample(1e3:5e3, 2e3, replace=T),
                  HousehNumcars=sample(0:3, 2e3, replace=TRUE))

massisenergy · Accepted Answer · 2020-03-02T11:27:12.220

Here is another approach, using the famous dplyr for data manipulation and ggplot for plotting the graphs. The pacakge magrittr is for pipe %>% construct.

STEP1

Read the data and structure it as a dataframe named df. Remember to use stringsAsFactors = F to make columns anything of type other than factor for easier data manipulation in the next steps.

library(dplyr); library(magrittr); library(ggplot2)
d1 <- read.table(text = "IncHouseh HousehNumcars
1           800             2
2           384             2
4           638             1
5           580             2
6           700             2
7           744             2
8           560             1
9           500             1
10          686             1
11          310             1
12          510             1
13          648             2
14          372             1
15          542             1", header =T)
df <- data.frame(d1, stringsAsFactors = F)

STEP2

Use mutate (to add new columns suitable for plotting), case_when (to make if-else construct)

df <- df %>% mutate(x_labels = case_when(IncHouseh <= 100 & IncHouseh > 0 ~ "under100",
                                      IncHouseh <= 200 & IncHouseh > 100 ~ "100-200",
                                      IncHouseh <= 300 & IncHouseh > 200 ~ "200-300",
                                      IncHouseh <= 400 & IncHouseh > 300 ~ "300-400",
                                      IncHouseh <= 500 & IncHouseh > 400 ~ "400-500",
                                      IncHouseh <= 600 & IncHouseh > 500 ~ "500-600",
                                      IncHouseh <= 700 & IncHouseh > 600 ~ "600-700",
                                      IncHouseh <= 800 & IncHouseh > 700 ~ "700-800",
                                      IncHouseh <= 900 & IncHouseh > 800 ~ "800-900",
                                      IncHouseh <= 1000 & IncHouseh > 900 ~ "900-1000"))
df <- df %>% group_by(x_labels) %>% mutate(Probability = (HousehNumcars/sum(HousehNumcars)
                                           *100), Cars = as.factor(HousehNumcars))

STEP3

Plotting!

plot <- df %>% ggplot(aes(x = x_labels, y = Probability, fill = Cars)) + geom_col()
#some codes for beautification, but not necessary
plot + ylab("Probability or number of cars (%)") + xlab("Range of income") + 
ggtitle("Number of cars according to houshold income") +
theme(plot.title = element_text(hjust = 0.5))

EDIT

Goal: To make the grouping in a custom way for the fill variable Cars. THe idea is same: using case_when.

df_cars <- df %>% group_by(x_labels) %>% mutate(Probability = (HousehNumcars/
                                                sum(HousehNumcars)*100), Cars = 
                                       case_when(HousehNumcars == 1 ~ "1",
                                                 HousehNumcars >= 2 ~ "2+"))

#Plotting in the same way:
plot <- df_cars %>% ggplot(aes(x = x_labels, y = Probability, fill = Cars)) + geom_col()
plot + ylab("Probability or number of cars (%)") + xlab("Range of income") + 
ggtitle("Number of cars according to houshold income") +
  theme(plot.title = element_text(hjust = 0.5))

Hi, this helped a lot, so thank you for the code. However, something seems to go wrong with my car grouping because i have missing values (NA) in some of the rows. And also the maximum number of cars is 12, but for some reason 20 is also shown as a group. I've included the figure in the post. Thanks a lot! — nostres, Mar 01 '20 at 17:50
Hmm, it would help if I get to see some more of the data (& more importantly **more representative**). Can you take about `~30` rows from your data & post it? 15 is too less for this purpose I guess & also it only contains 1/2 cars. Need something that spans in the whole spectrum. Maybe it's because of this `NA` problem. Did you use `dput()` to take the example data that you gave here? If not, please use that. Please check this: [Example of using `dput()`](https://stackoverflow.com/questions/49994249/example-of-using-dput) — massisenergy, Mar 01 '20 at 18:51
The values pretty much look the same, the only difference is that there are some missing values. But it was quite easily fixable by using the na.omit function. Btw I've updated my post above with the correct figure. The only thing I'd like to change about it is that I can get a group that covers 6+ cars instead of 6,7,8,9,10 as they are not that representative individually — nostres, Mar 01 '20 at 22:30
Okay, glad to help. It's easy, using the `case_when`, in the same way. I just showed it for two cars, as I don't have more in the dataset. — massisenergy, Mar 02 '20 at 11:28
Also, can you please upvote and accept the answer to mark that the problem is solved or it's helpful? — massisenergy, Mar 02 '20 at 11:30

Histogram/bar chart containing two variables in bar

2 Answers2

Result