0

I am using the R programming language. I created the following data set for this example:

var_1 <- rnorm(1000,10,10)
var_2 <- rnorm(1000, 5, 5)
var_3 <- rnorm(1000, 6,18)

favorite_food <- c("pizza","ice cream", "sushi", "carrots", "onions", "broccoli", "spinach", "artichoke", "lima beans", "asparagus", "eggplant", "lettuce", "cucumbers")
favorite_food <-  sample(favorite_food, 1000, replace=TRUE, prob=c(0.5, 0.45, 0.04, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001))


response <- c("a","b")
response <- sample(response, 1000, replace=TRUE, prob=c(0.3, 0.7))


data = data.frame( var_1, var_2, var_3, favorite_food, response)

data$favorite_food = as.factor(data$favorite_food)
data$response = as.factor(data$response)

From here, I want to make histograms for the two categorical variables in this data set and put them on the same page:

#make histograms and put them on the same page (note: I don't know why the "par(mfrow = c(1,2))" statement is not working)
par(mfrow = c(1,2))

histogram(data$response, main = "response"))

histogram(data$favorite_food, main = "favorite food"))

enter image description here

My question : Is it possibly to automatically produce histograms for all categorical variables (without manually writing the "histogram()" statement for each variable) in a given data set and print them on the same page? Is it better to the use the "ggplot2" library instead for this problem ?

I can manually write the "histogram()" statement for each individual categorical variables in the data set, but I was looking for a quicker way to do this. Is it possible to do this with a "for loop"?

Thanks

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • NB this code isn't quite reproducible as you haven't said which package you are using to call `histogram` and there appear to be some superfluous brackets at the end of those calls – Scransom Apr 29 '21 at 04:33
  • [As a subsequent question suggests, I think the OP was using `hist` but wrote `histogram` here.] – Jon Spring Apr 30 '21 at 02:58

4 Answers4

4

A ggplot2/tidyverse solution is to lengthen each column into data and then use faceting to plot them all in the same page:

(with edit to plot only factor variables)

factor_vars <- sapply(data, is.factor)

varnames <- names(data)

deselect_not_factors <- varnames[!factor_vars]

library(tidyr)
library(ggplot2)

data_long <- data %>%
  pivot_longer(
    cols = -deselect_not_factors,
    names_to = "category",
    values_to = "value"
  )

ggplot(data_long) +
  geom_bar(
    aes(x = value)
  ) +
  facet_wrap(~category, scales = "free")

enter image description here

Scransom
  • 3,175
  • 3
  • 31
  • 51
  • Thank you! But suppose the variables all have different names - is there a way to only select categorical variables (i.e. factor)? – stats_noob Apr 29 '21 at 04:24
  • 1
    I've edited to do this by working out which cols are factors then excluding them in the `cols` argument to `pivot_longer` – Scransom Apr 29 '21 at 04:31
  • Thank you so much! Just a question: is it possible to replace the "pivot_longer" startement using functions from the "dplyr" and "reahape2" library? Thanks again for all your help! – stats_noob Apr 29 '21 at 17:13
  • You might be able to get the same result as `pivot_longer` using some stuffing around with `melt` and `cast` from `reshape2` but there's no real reason to still be using that package. `tidyverse` (including `dplyr` and `tidyr`) are the modern and more user friendly evolution from `reshape2`. Message: just install `tidyr` – Scransom Apr 30 '21 at 04:45
3

Here's a base R alternative using barplot in for loop :

cols <- names(data)[sapply(data, is.factor)]


#This would need some manual adjustment if number of columns increase
par(mfrow = c(1,length(cols))) 

for(i in cols) {
  barplot(table(data[[i]]), main = i)
}

enter image description here

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • is there a way to avoid this error - Error in plot.new() : figure margins are too large? – stats_noob Apr 29 '21 at 16:56
  • Have you tried suggestions from this post? https://stackoverflow.com/questions/23050928/error-in-plot-new-figure-margins-too-large-scatter-plot – Ronak Shah Apr 29 '21 at 23:34
2

As an alternative, you can capitalize on the fantastic DataExplorer package.

Note that histograms are for continuous variables and hence, you wanted to create bar plots for your categorical variables. This can be done as follows:

if(require(DataExplorer)==FALSE) install.packages("DataExplorer"); library(DataExplorer)
DataExplorer::plot_histogram(data) # plots histograms for continuous variables
DataExplorer::plot_bar(data) # bar plots for categorical variables

Please refer to the package manual for more details.

1

Here is a try using cowplot & ggplot2

library(ggplot2)
library(dplyr)
library(foreach)
library(cowplot)

list_variables <- c("response", "favorite_food")
all_plot <- foreach(current_var = c(list_variables)) %do% {
  # need to do this to avoid ggplot reference to same summary data afterward.
  data_summary_name <- paste0(current_var, "_summary")
  eval(substitute(
    {
      graph_data <- data %>%
        group_by(!!sym(current_var)) %>%
        summarize(count = n(), .groups = "drop") %>%
        mutate(share = count / sum(count))
      plot <- ggplot(graph_data) +
        geom_bar(mapping = aes(x = !!sym(current_var), y = share), width = 1,
          fill = "#00FFFF", color = "#000000", stat = "identity") +
        scale_y_continuous(labels = scales::percent) +
        ggtitle(current_var) + ylab("Perecent of Total") +
        theme_bw()
    }, list(graph_data = as.name(data_summary_name))
  )) 
  return(plot)
}

plot_grid(plotlist = all_plot, ncol = 2)

Note: For reference about why I use eval & substitue you can reference to this question on ggplot2 generate same plot for different variables in a for loop

Using facet_wrap as approach similar to QuishSwash with data calculated in share instead

list_variables <- c("response", "favorite_food")
# Calculate share for choosen variables defined in list_variables 
# You can adjust by having some variables selection based on some condition
summary_df <- bind_rows(foreach(current_var = c(list_variables)) %do% {
  data %>%
    group_by(variable = !!sym(current_var)) %>%
    summarize(count = n(), .groups = "drop") %>%
    mutate(share = count / sum(count),
      variable_name = current_var)
})

ggplot(summary_df) +
  geom_bar(
    aes(x = variable, y = share),
    fill = "#00FFFF", color = "#000000", stat = "identity") +
  facet_wrap(~variable_name, scales = "free") +
  scale_y_continuous(labels = scales::percent) +
  theme_bw()

Created on 2021-04-29 by the reprex package (v2.0.0)

Sinh Nguyen
  • 4,277
  • 3
  • 18
  • 26
  • thank you! Is there a way to replace the list_variables object, and automatically put all the variable names into list_variables, without individually writing them? – stats_noob Apr 29 '21 at 17:16
  • @stats555 there are many ways to do that, and already stated in others answers. It really depend on your dataset. In this case if just factor then `cols <- names(data)[sapply(data, is.factor)]` from @RonakShah answer is concise and beautiful – Sinh Nguyen Apr 29 '21 at 21:00