0

I have a technical question regarding this example dataset (using RStudio) :

So I created a function that allows me to conduct descriptive analysis visualisation (it still needs some work) but for now it would look like this (with the use of boxplots as an example) :

library(ggplot2)
library(dplyr)

data("Salaries", package = "carData")

f <- function(x) {
  lapply(X = Salaries %>% select_if(is.numeric), FUN = function(X) {
    ggplot(Salaries, aes(x, y = X, fill = x, color = x)) +
      geom_boxplot(col = "black")
  })
}

lapply(Salaries %>% select_if(is.factor), FUN = function(X) f(X))

So now I am able to visualise boxplots of all possible categorical and continuous variables.

However, I am not able to find a way to make sure that I have different fill colours for each bloxplot. (I would appreciate to know how to apply fill colours automatically and manually).

Thanks.

stefan
  • 90,330
  • 6
  • 25
  • 51
Anas116
  • 797
  • 2
  • 9

2 Answers2

1

I am surprised that you get the described problem with the boxplot colors as, when I run your code, the boxplots I obtain are generated with different fill colors, for example: enter image description here

However, one problem with your code is that you don't see what is being plotted in the graph (you see x and X in the axis labels for all plots), and this is a problem with lapply() because it doesn't give you access to the names of the analyzed variables.

I then suggest the following improvement to your code, so that each plot shows the name of the analyzed variables on the axis labels. The solution was inspired by the first comment by Akrun on this post, precisely about the problem with lapply(), where the user suggests using names(obj) instead of obj as the argument of lapply().

library(ggplot2)
library(dplyr)

data("Salaries", package = "carData")

f <- function(df, xname) {
  x = df[[xname]]
  toplot = df %>% select_if(is.numeric)
  lapply(
      names(toplot), FUN = function(yname) {
        y = toplot[[yname]]
        print(ggplot(mapping=aes(x, y, fill = x)) +
          geom_boxplot(col = "black") + xlab(xname) + ylab(yname))
      }
    )
}

Salaries_factors = Salaries %>% select_if(is.factor)
invisible(lapply(names(Salaries_factors), FUN = function(factor_name) f(Salaries, factor_name)))

As a summary, the main change w.r.t. your code was to replace lapply(Salaries...) with lapply(names(Salaries), ...) on the last line.

When we run this code, we get the boxplot shown at the end (containing the distribution of the salary variable in terms of the sex factor), where both the horizontal and the vertical labels are informative of the variables being plotted.

Note the following additional side changes I did to your original code:

  1. I made the function applicable to other datasets by adding the data frame containing the data as first parameter.
  2. I added the invisible() call to lapply() in order to eliminate the (possibly unwanted) messages generated by lapply() of the groups being analyzed at each iteration(*). At the same time, this required enclosing the ggplot() call with print()... otherwise, no plots are generated.

(*) As a caveat, should the automatic printing of lapply() be of interest, this solution would NOT show informative values of the groups if the invisible() call is removed. The information one sees in that case is simply [[1]], [[2]], etc., instead of $rank, $sex, etc.

enter image description here

mastropi
  • 1,354
  • 1
  • 10
  • 14
  • Wow thanks a lot for working out this another problem. Regarding the colour problem it's my bad I didn't make it clear enough with my question. What I meant is that for each group of boxplots created by this iterative command, I want the colours to be different from other groups. for example : the first group is male and female. Male is blue. Female is pink. the second group is discipline A and B. A is green and B is purple, and so on. – Anas116 Aug 17 '22 at 10:09
  • Ah, ok... Then you should take a look at the `scale_fill_manual()` function in ggplot2 to define the colours in function `f()` when analyzing each group. Let me know if you struggle with using it and I can try to help you out. – mastropi Aug 17 '22 at 14:00
  • Actually I tried it before, but it's quite tiring in the presence of so many variables, let alone their categories. I want to find an iterative way to make sure that for every group there is a new group of colour without setting it up myself for each group. – Anas116 Aug 17 '22 at 15:30
  • I see. Based on your difficulties, I just posted a new answer below (https://stackoverflow.com/a/73402001/6118609) that should satisfy your needs. The solution is general for any number of factor variables, taking any number of different values (categories). – mastropi Aug 18 '22 at 11:10
1

Based on the OP's comments to my first answer, stating what they are really after, I now give a solution that integrates my previous answer with the OP's wishes.

Thus, this solution:

  • shows the variable labels in each plot (as done already by the solution in my first answer) (not requested but good to have)
  • uses a different color set for the boxplots in each analyzed factor (requested)

The solution is based on:

  1. Gathering relevant information about the factor variables, namely: how many there are, how many categories per factor variable, how many categories in total.
  2. Storing related information as part of the names of the factor variables in the data frame of factor variables (Salaries_factors).
  3. Defining a color palette with as many colors as the total number of categories across all factor variables.

The implementation of the f() leverages this information and does the rest.


library(ggplot2)
library(dplyr)

f <- function(df, x_idx_name_depth, colors_palette) {
  # Get the relevant information about the x variable to plot
  # which will allow us to define the colors to use for the boxplots
  x_info = unlist( strsplit(x_idx_name_depth, ",") )
  idx_color_start = as.numeric(x_info[1])  # start position for the color set in the palette
  xname = x_info[2]
  n_colors = as.numeric(x_info[3])  # How many values the x variable takes
  
  # Get the values of the x variable
  x = df[[xname]]
  
  # Define the color set to use for the boxplots
  colors2use = setNames(colors_palette[idx_color_start:(idx_color_start+n_colors-1)],
                        names(table(x)))

  # Define all the continuous variables to visualize (one at a time)
# with boxplots against the x variable
  toplot = df %>% select_if(is.numeric)
  lapply(
    names(toplot), FUN = function(yname) {
      y = toplot[[yname]]
      print(ggplot(mapping=aes(x, y, fill=x)) +
              geom_boxplot(color = "black") + xlab(xname) + ylab(yname) +
              scale_fill_manual(values=colors2use, aesthetics="fill"))
    }
  )
}

# Data for analysis
data("Salaries", package = "carData")

# Data containing the factor variables used to group the boxplots
Salaries_factors = Salaries %>% select_if(is.factor)

# Characteristics of the factor variables which will help us
# define the color set in each boxplot group 
factor_names = names(Salaries_factors)
n_factors = length(factor_names)
n_categories_by_factor = unlist(lapply(Salaries_factors, FUN=function(x) length(unique(x))))
n_categories = sum(n_categories_by_factor)
color_start_index_by_factor = setNames( c(1, 1+cumsum(n_categories_by_factor[1:(n_factors-1)])),
                                        factor_names )

# Set smart names to the factor variables so that we can infer the information needed to
# define different (non-overlapping) color sets for the different boxplot groups.
# These names allow us to infer:
# - the order in which the factor variables are analyzed by the lapply() call
#   --> this defines each color set.
# - the number of different values each factor variable takes (categories)
#   --> this defines each color within each color set
# Ex: "4,discipline,2"
names(Salaries_factors) = paste(color_start_index_by_factor,
                                names(Salaries_factors),
                                n_categories_by_factor,
                                sep=",")

# Define the colors palette to use
colors_palette = terrain.colors(n=n_categories)

# Run the process
invisible(lapply(names(Salaries_factors),
                 FUN = function(factor_idx_name_depth)
                          f(Salaries, factor_idx_name_depth, colors_palette)))

Here I show the generated boxplots for the salary variable in terms of the three factor variables:

rank factor discipline factor sex factor

mastropi
  • 1,354
  • 1
  • 10
  • 14