How to set a legend/key and colors independent of or directly linked to certain values in ggplot2?

Question

Soo, I have got a series of plots that I would like to make - depending on a survey done with people. All of them depend on a corresponding column of a data frame, each column filled with a different rang of numbers, from 1 to x, where x depends on the question the plot is related to (some question are answered from 1 to 5, some from 1 to 7 and so on)

I would like to have a fixed legend/key for those questions/plots that have the same answering possibilities e.g.: c("Strongly disagree", "Disagree", "Somewhat disagree", "Neither agree or disagree", "Somewhat agree", "Agree", "Strongly agree")) the first option "Strongly disagree" is a "1" in the data, "Disagree" is a "2" and so on.

To make them easily comparable they should have the same legend/key with the same options and colours.

My Problem is that there are a number of occasions where one or more of the answer options of a question was not chosen by any of the respondents. My current code that looks something like this:

education_plot <- ggplot(Data) +
    aes(Cluster, fill = as.character(Education)) + 
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) +
    scale_fill_manual(name = "Level of education", labels = c("No schooling completed", "Some high school, no diploma", "High school graduate, diploma or
#the equivalent", "College graduate", "Trade/technical/vocational training", "Bachelors degree", "Masters degree", "Doctorate degree")

I have got number of these codeblocks to build every single graph. A graph should display which option was chosen how often (scaled to 100%) in each respondent cluster.

Example: If now no respondent was choosing "No schooling completed" ("1") the legend/key would still use this term and assign a colour but would display the answers "Some high school, no diploma" ("2") in the colour of "No schooling completed", so the legend/key would have the wrong names with the values theoretically connected with it and would not show all of the answer options in the legend/key. (cuts of the last n answer options in the legend/key where n is the number of answer options that nobody chose)

Image of an example graph

Here the last answer option "Doctorate degree" is cut off but actually nobody chose the first option: "No schooling completed", but these are shown and coloured in the "wrong" data since it should be 0/no bar for this option.

Can someone help me with setting a legend/key that is always fully printed and then showing of the correct values including 0 if not chosen by any respondent???

edit: my test code looks like this:

color_mapping <- setNames(hue_pal() (8), 8)
    
    education_plot <- ggplot(Data) +
    aes(Cluster, fill = as.character(Education)) + 
    geom_bar(position = "fill") +
    scale_y_continuous(labels = scales::percent) +
    scale_fill_manual(name = "Level of education", values = color_mapping,  drop = FALSE, labels = c("No schooling completed", "Some high school, no diploma", "High school graduate, diploma or
the equivalent", "College graduate", "Trade/technical/vocational training", "Bachelors degree", "Masters degree", "Doctorate degree"))

resulting graph

The problem is that the last label still is not represented in the legend ("Doctorate degree") and the data is coloured/connected wrongly since in this example no respondent answered with "No schooling completed". My code simply doesnt know how to match the right value (1-8 in this example) to the right category (label), so it finds 7 different values (2-8) and assigns them to the first 7 labels I definded. How do I tell my code how to match them and shouldnt the legend at least present "Doctorate degree" sind I set drop = FALSE

Dataset produced by dput():

! structure(list(Education = c(7, 4, 7, 7, 8, 6, 6, 8, 8, 6, 4, 5, 6, 7, 6, 8, 4, 4, 8, 7, 7, 3, 5, 7, 4, 4, 7, 7, 7, 5, 7, 3, 7, 8, 6, 8, 5, 7, 5, 6, 4, 6, 3, 6, 7, 7, 6, 4, 2, 7, 3, 6, 4, 4, 6, 6, 4, 4, 8, 7, 4, 4, 8, 6, 5, 7, 7, 7, 7, 4, 6, 4, 8, 8, 7, 8, 8, 6, 7, 4, 6, 6, 6, 5, 6, 7, 7, 4, 7, 6, 7, 7, 7, 4, 6, 7, 6, 3, 7, 7, 7, 6, 6, 4, 6, 4, 6, 4, 8, 7, 4, 5, 4, 6, 4, 7, 6, 6, 4, 7, 6, 6, 8, 7, 8, 5, 7, 7, 8, 7, 6, 6, 6, 4, 8, 7, 8, 6, 6, 4, 7, 6, 6, 6, 3, 7, 7, 4, 8, 8, 7, 8, 7, 4, 6, 4, 8, 6, 7, 7, 3, 7, 5, 8, 6, 3, 7, 7, 8, 4, 8, 6, 7, 7, 6, 6, 3, 6, 6, 8, 6, 6, 2, 4, 7, 6, 8, 8, 6, 3, 4, 8, 7, 6, 5, 7, 7, 8, 7, 3, 6, 4, 4, 4, 7, 4, 8, 7, 7, 6, 6, 6, 6, 6, 3, 4, 7, 6, 6, 6, 6, 6, 4, 6, 7, 7, 3, 6, 7, 6, 6, 6, 4, 7, 6, 6, 6, 7, 7, 4, 6, 3, 6, 6, 6, 6, 7, 6, 6, 4, 4, 6, 6, 4, 4, 4, 6, 4, 6, 6, 6, 6, 6, 6, 4, 6, 4, 4, 6, 6, 6, 8, 6, 6), Cluster = c(4L, 4L, 2L, 2L, 2L, 2L, 4L, 3L, 3L, 2L, 3L, 2L, 4L, 4L, 2L, 4L, 2L, 2L, 4L, 4L, 2L, 3L, 3L, 2L, 3L, 2L, 1L, 4L, 2L, 4L, 4L, 1L, 2L, 2L, 4L, 2L, 1L, 2L, 4L, 2L, 1L, 2L, 4L, 3L, 3L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 4L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 4L, 2L, 2L, 4L, 2L, 2L, 2L, 4L, 4L, 2L, 2L, 4L, 2L, 2L, 3L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 4L, 1L, 2L, 4L, 4L, 2L, 3L, 2L, 2L, 2L, 4L, 4L, 1L, 2L, 4L, 4L, 4L, 2L, 4L, 2L, 2L, 2L, 2L, 1L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 4L, 2L, 3L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 3L, 4L, 2L, 2L, 4L, 4L, 1L, 2L, 2L, 4L, 2L, 2L, 2L, 3L, 4L, 4L, 4L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 4L, 4L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 4L, 2L, 2L, 4L, 1L, 2L, 2L, 4L, 2L, 4L, 3L, 4L, 2L, 3L, 1L, 4L, 4L, 4L, 2L, 2L, 2L, 4L, 2L, 1L, 3L, 2L, 1L, 2L, 2L, 3L, 2L, 3L, 1L, 4L, 3L, 4L, 3L, 3L, 4L, 4L, 4L, 1L, 2L, 3L, 2L, 3L, 4L, 3L, 4L, 4L, 2L, 4L, 4L, 2L, 4L, 4L, 2L, 2L, 2L, 4L, 4L, 2L, 4L, 3L, 1L, 4L, 4L, 2L, 2L, 4L, 2L, 2L, 4L, 3L, 2L, 2L, 1L)), row.names = c(NA, -274L), class = "data.frame")

Update responding to Update 2 from Dan Adams: my Code:

education_plot <- ggplot(test1) +
aes(Cluster, fill = as.character(Education)) + 
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(name = "Level of education", values = color_mapping,  drop = F)

education_plot

his code:

data1 %>% 
  ggplot(aes(x = Cluster)) +
  geom_bar(aes(fill = Education), stat = "count", position = "fill") +
  scale_fill_manual(values = color_mapping, drop = F) +
  scale_y_continuous(labels = percent)

which is the result I wanted.

it looks like you're mixing a few different examples here. Can you please share some of your actual data and one clean attempt at plotting it so we can help. — Dan Adams, Mar 16 '21 at 15:36
I tried to unify my examples. How do I share the relevant dataset? — Sese, Mar 16 '21 at 15:52
Also check out: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Dan Adams, Mar 16 '21 at 16:04

Dan Adams · Accepted Answer · 2021-03-17T12:21:31.553

Update 3

You are using as.character(Education) which doesn't have levels so it will never be able to retain values not present in the actual data. This can only be accomplished with a factor. Also you need a factor to enforce the order. Otherwise the categories will sort alphabetically.

Update 2

I'll leave my original answer below in case it's helpful to others down the line. However, with the data you shared, I think it's easiest to use fct_recode() to modify the original data with the labels you want.

# load libraries
library(tidyverse)
library(scales)

# import data
data <- structure(list(Education = c(7, 4, 7, 7, 8, 6, 6, 8, 8, 6, 4, 5, 6, 7, 6, 8, 4, 4, 8, 7, 7, 3, 5, 7, 4, 4, 7, 7, 7, 5, 7, 3, 7, 8, 6, 8, 5, 7, 5, 6, 4, 6, 3, 6, 7, 7, 6, 4, 2, 7, 3, 6, 4, 4, 6, 6, 4, 4, 8, 7, 4, 4, 8, 6, 5, 7, 7, 7, 7, 4, 6, 4, 8, 8, 7, 8, 8, 6, 7, 4, 6, 6, 6, 5, 6, 7, 7, 4, 7, 6, 7, 7, 7, 4, 6, 7, 6, 3, 7, 7, 7, 6, 6, 4, 6, 4, 6, 4, 8, 7, 4, 5, 4, 6, 4, 7, 6, 6, 4, 7, 6, 6, 8, 7, 8, 5, 7, 7, 8, 7, 6, 6, 6, 4, 8, 7, 8, 6, 6, 4, 7, 6, 6, 6, 3, 7, 7, 4, 8, 8, 7, 8, 7, 4, 6, 4, 8, 6, 7, 7, 3, 7, 5, 8, 6, 3, 7, 7, 8, 4, 8, 6, 7, 7, 6, 6, 3, 6, 6, 8, 6, 6, 2, 4, 7, 6, 8, 8, 6, 3, 4, 8, 7, 6, 5, 7, 7, 8, 7, 3, 6, 4, 4, 4, 7, 4, 8, 7, 7, 6, 6, 6, 6, 6, 3, 4, 7, 6, 6, 6, 6, 6, 4, 6, 7, 7, 3, 6, 7, 6, 6, 6, 4, 7, 6, 6, 6, 7, 7, 4, 6, 3, 6, 6, 6, 6, 7, 6, 6, 4, 4, 6, 6, 4, 4, 4, 6, 4, 6, 6, 6, 6, 6, 6, 4, 6, 4, 4, 6, 6, 6, 8, 6, 6), Cluster = c(4L, 4L, 2L, 2L, 2L, 2L, 4L, 3L, 3L, 2L, 3L, 2L, 4L, 4L, 2L, 4L, 2L, 2L, 4L, 4L, 2L, 3L, 3L, 2L, 3L, 2L, 1L, 4L, 2L, 4L, 4L, 1L, 2L, 2L, 4L, 2L, 1L, 2L, 4L, 2L, 1L, 2L, 4L, 3L, 3L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 4L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 4L, 2L, 2L, 4L, 2L, 2L, 2L, 4L, 4L, 2L, 2L, 4L, 2L, 2L, 3L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 4L, 1L, 2L, 4L, 4L, 2L, 3L, 2L, 2L, 2L, 4L, 4L, 1L, 2L, 4L, 4L, 4L, 2L, 4L, 2L, 2L, 2L, 2L, 1L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 4L, 2L, 3L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 3L, 4L, 2L, 2L, 4L, 4L, 1L, 2L, 2L, 4L, 2L, 2L, 2L, 3L, 4L, 4L, 4L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 3L, 4L, 4L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 4L, 2L, 2L, 4L, 1L, 2L, 2L, 4L, 2L, 4L, 3L, 4L, 2L, 3L, 1L, 4L, 4L, 4L, 2L, 2L, 2L, 4L, 2L, 1L, 3L, 2L, 1L, 2L, 2L, 3L, 2L, 3L, 1L, 4L, 3L, 4L, 3L, 3L, 4L, 4L, 4L, 1L, 2L, 3L, 2L, 3L, 4L, 3L, 4L, 4L, 2L, 4L, 4L, 2L, 4L, 4L, 2L, 2L, 2L, 4L, 4L, 2L, 4L, 3L, 1L, 4L, 4L, 2L, 2L, 4L, 2L, 2L, 4L, 3L, 2L, 2L, 1L)), row.names = c(NA, -274L), class = "data.frame")

# create renaming key
ed_factor_naming <-
  setNames(
    object = as.character(1:8),
    nm = c(
      "No schooling completed",
      "Some high school, no diploma",
      "High school graduate, diploma or the equivalent",
      "College graduate",
      "Trade/technical/vocational training",
      "Bachelors degree",
      "Masters degree",
      "Doctorate degree"
    )
  )

# recode data using key
data1 <- data %>% 
  mutate(Education = factor(Education, levels = 1:8)) %>% 
  mutate(Education = fct_recode(Education, !!!ed_factor_naming))

# set color mapping from levels
color_mapping <- setNames(hue_pal()(length(levels(data1$Education))), levels(data1$Education))

# plot with drop = FALSE to retain empty levels
data1 %>% 
  ggplot(aes(x = Cluster)) +
  geom_bar(aes(fill = Education), stat = "count", position = "fill") +
  scale_fill_manual(values = color_mapping, drop = F) +
  scale_y_continuous(labels = percent)

^{Created on 2021-03-16 by the reprex package (v1.0.0)}

Update 1

You can still do this as I described, but you can set whatever labels you like in scale_fill_manual() to recode the levels in your data to what you want them to display as. Alternatively you can change them in your actual data with functions like mutate(var = case_when(***)) or factor_recode(). See updated example below:

Original Answer

Two keys to getting what you wanted here:

Use a named vector for colors to unambiguously assign them so that they will always map the same even if some are empty.
Add drop = FALSE to scale_fill_manual() to retain empty factor levels.

# load packages
library(tidyverse)
library(scales)

# make data reproducible
set.seed(1)

# simulate data
grp = 1:4
freq = LETTERS[1:5]

df <- expand_grid(grp, freq) %>%
  mutate(across(everything(), as.factor)) %>% 
  bind_cols(count = sample(x = c(1:10, rep(NA, 4)),
                        size = length(grp)*length(freq),
                        replace = T)) %>% 
  mutate(count = ifelse(freq == "E", NA, count)
  )

# set unambiguous color mapping for each category with named vector
color_mapping <- setNames(hue_pal()(length(freq)), freq)

# plot and use drop = FALSE in scale_fill_manual() to preserve empty factor levels
df %>% 
  ggplot(aes(x = grp, y = count)) +
  geom_col(aes(fill = freq), position = "fill") +
  scale_fill_manual(values = color_mapping, drop = F, labels = c("These", "Are", "Arbitrary", "Legend", "Labels"))
#> Warning: Removed 9 rows containing missing values (position_stack).