217

I've been getting up to speed with R in the last month.

Here is my question:

What is a good way to assign colors to categorical variables in ggplot2 that have stable mapping? I need consistent colors across a set of graphs that have different subsets and different number of categorical variables.

For example,

plot1 <- ggplot(data, aes(xData, yData,color=categoricaldData)) + geom_line()

where categoricalData has 5 levels.

And then

plot2 <- ggplot(data.subset, aes(xData.subset, yData.subset, 
                                 color=categoricaldData.subset)) + geom_line()

where categoricalData.subset has 3 levels.

However, a particular level that is in both sets will end up with a different color, which makes it harder to read the graphs together.

Do I need to create a vector of colors in the data frame? Or is there another way to assigns specific colors to categories?

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
wintour
  • 2,295
  • 2
  • 14
  • 10

5 Answers5

232

For simple situations like the exact example in the OP, I agree that Thierry's answer is the best. However, I think it's useful to point out another approach that becomes easier when you're trying to maintain consistent color schemes across multiple data frames that are not all obtained by subsetting a single large data frame. Managing the factors levels in multiple data frames can become tedious if they are being pulled from separate files and not all factor levels appear in each file.

One way to address this is to create a custom manual colour scale as follows:

#Some test data
dat <- data.frame(x=runif(10),y=runif(10),
        grp = rep(LETTERS[1:5],each = 2),stringsAsFactors = TRUE)

#Create a custom color scale
library(RColorBrewer)
myColors <- brewer.pal(5,"Set1")
names(myColors) <- levels(dat$grp)
colScale <- scale_colour_manual(name = "grp",values = myColors)

and then add the color scale onto the plot as needed:

#One plot with all the data
p <- ggplot(dat,aes(x,y,colour = grp)) + geom_point()
p1 <- p + colScale

#A second plot with only four of the levels
p2 <- p %+% droplevels(subset(dat[4:10,])) + colScale

The first plot looks like this:

enter image description here

and the second plot looks like this:

enter image description here

This way you don't need to remember or check each data frame to see that they have the appropriate levels.

Axeman
  • 32,068
  • 8
  • 81
  • 94
joran
  • 169,992
  • 32
  • 429
  • 468
  • 1
    This will work, but is probably over-complicated. I don't think you need to create a manual scale for this. All you need is a `factor` that is common between all plots. – Andrie Aug 03 '11 at 10:31
  • 20
    @Andrie - For a single subset, yeah. But if you're juggling lots of data sets that weren't all created by subsetting one original data frame, I find this strategy much simpler. – joran Aug 03 '11 at 13:48
  • 2
    @joran Thanks Joran. This worked for me! It creates a legend with the right number of factors. I like the approach and to get color mappings across different data sets is well-worth the three lines. – wintour Aug 05 '11 at 19:46
  • 1
    @Rafael - No problem! I would recommend that you check back here in a few days, though, as hadley may yet swoop in with a better answer. He is the expert, after all. – joran Aug 05 '11 at 19:57
  • 3
    I needed: library("RColorBrewer") – PatrickT Apr 25 '14 at 21:26
  • 6
    worked perfectly! I added in `fillScale <- scale_fill_manual(name = "grp",values = myColors)` to use this with bar plots. – pentandrous Jun 27 '16 at 17:14
  • 1
    This approach has an added benefit: only values which appear on each subset-graph appear in the legend. I'd default to this approach if you're waffling. – Nova Oct 30 '17 at 13:24
  • 1
    Nice solution. In case one needs a large number of colours (e.g. >= 20 colors), using the built-in function colors() in R, i.e., my colours = colors()[1:20] – Good Will Mar 04 '20 at 17:52
  • Just want to add that setting colour=grp in the aes call is essential to getting the colours to show up! – TAH May 18 '21 at 14:37
  • 1
    If you need to use more than 9 colours, there's a neat trick documented using a colour ramp over at https://www.r-bloggers.com/2013/09/how-to-expand-color-palette-with-ggplot-and-rcolorbrewer/, which I needed in addition to the above. – dsz Aug 14 '21 at 02:01
  • Recently (since summer 2021?) one must add scale_fill_manual(... limits = force) or the unused variables will show up in the legend. see https://github.com/tidyverse/ggplot2/issues/4534 – Ira S Feb 08 '22 at 21:31
  • @dsz The code of the link does run but lots of levels have the same colour – Julien Jan 10 '23 at 13:59
45

I am in the same situation pointed out by malcook in his comment: unfortunately the answer by Thierry does not work with ggplot2 version 0.9.3.1.

png("figure_%d.png")
set.seed(2014)
library(ggplot2)
dataset <- data.frame(category = rep(LETTERS[1:5], 100),
    x = rnorm(500, mean = rep(1:5, 100)),
    y = rnorm(500, mean = rep(1:5, 100)))
dataset$fCategory <- factor(dataset$category)
subdata <- subset(dataset, category %in% c("A", "D", "E"))

ggplot(dataset, aes(x = x, y = y, colour = fCategory)) + geom_point()
ggplot(subdata, aes(x = x, y = y, colour = fCategory)) + geom_point()

Here it is the first figure:

ggplot A-E, mixed colors

and the second figure:

ggplot ADE, mixed colors

As we can see the colors do not stay fixed, for example E switches from magenta to blu.

As suggested by malcook in his comment and by hadley in his comment the code which uses limits works properly:

ggplot(subdata, aes(x = x, y = y, colour = fCategory)) +       
    geom_point() + 
    scale_colour_discrete(drop=TRUE,
        limits = levels(dataset$fCategory))

gives the following figure, which is correct:

correct ggplot

This is the output from sessionInfo():

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] methods   stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] ggplot2_0.9.3.1

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4       grid_3.0.2        
 [5] gtable_0.1.2       labeling_0.2       MASS_7.3-29        munsell_0.4.2     
 [9] plyr_1.8           proto_0.3-10       RColorBrewer_1.0-5 reshape2_1.2.2    
[13] scales_0.2.3       stringr_0.6.2 
Community
  • 1
  • 1
Alessandro Jacopson
  • 18,047
  • 15
  • 98
  • 153
  • 3
    You should post this as a new question, referencing this question and showing why the solutions here didn't work. – Brian Diggs Jan 15 '14 at 20:19
  • 1
    So I know this is old but I wonder if there is a way to do this without having the extra colors in the legend. – goryh Jan 17 '20 at 00:48
  • To remove unused levels from a legend, now limit=force should be added. https://github.com/tidyverse/ggplot2/issues/4556 – Marinka Dec 20 '21 at 17:23
37

This is an old post, but I was looking for answer to this same question,

Why not try something like:

scale_color_manual(values = c("foo" = "#999999", "bar" = "#E69F00"))

If you have categorical values, I don't see a reason why this should not work.

Pavlos Panteliadis
  • 1,495
  • 1
  • 15
  • 25
  • 6
    This is actually what Joran's answer does, but using `myColors <- brewer.pal(5,"Set1"); names(myColors) <- levels(dat$grp)` to avoid having to manually code the levels. – Axeman Apr 09 '18 at 07:53
  • 3
    However, Joran's answer does not hard code the values of the colors. There are cases where you need a specific color value for a given factor. – René Nyffenegger May 27 '19 at 21:44
  • 1
    While I get the downside of "hard coding" in certain cases, I think that too often the layers of abstraction developers/coders add makes their work less accessible, not more. The intent is 100% clear in this case. Plus it is easy enough to think of how to make a utility function that expands on this example that returns a named vector of specific colors. – Matt Barstead Jan 25 '20 at 23:13
20

Based on the very helpful answer by joran I was able to come up with this solution for a stable color scale for a boolean factor (TRUE, FALSE).

boolColors <- as.character(c("TRUE"="#5aae61", "FALSE"="#7b3294"))
boolScale <- scale_colour_manual(name="myboolean", values=boolColors)

ggplot(myDataFrame, aes(date, duration)) + 
  geom_point(aes(colour = myboolean)) +
  boolScale

Since ColorBrewer isn't very helpful with binary color scales, the two needed colors are defined manually.

Here myboolean is the name of the column in myDataFrame holding the TRUE/FALSE factor. date and duration are the column names to be mapped to the x and y axis of the plot in this example.

Marian
  • 14,759
  • 6
  • 32
  • 44
  • Another approach is to apply "as.character()" to the column. This will make it a string column that works well with scale_*_manual – Sahir Moosvi May 01 '20 at 18:55
18

The easiest solution is to convert your categorical variable to a factor prior to the subsetting. Bottomline is that you need a factor variable with exact the same levels in all your subsets.

library(ggplot2)
dataset <- data.frame(category = rep(LETTERS[1:5], 100), 
    x = rnorm(500, mean = rep(1:5, 100)), y = rnorm(500, mean = rep(1:5, 100)))
dataset$fCategory <- factor(dataset$category)
subdata <- subset(dataset, category %in% c("A", "D", "E"))

With a character variable

ggplot(dataset, aes(x = x, y = y, colour = category)) + geom_point()
ggplot(subdata, aes(x = x, y = y, colour = category)) + geom_point()

With a factor variable

ggplot(dataset, aes(x = x, y = y, colour = fCategory)) + geom_point()
ggplot(subdata, aes(x = x, y = y, colour = fCategory)) + geom_point()
eli-k
  • 10,898
  • 11
  • 40
  • 44
Thierry
  • 18,049
  • 5
  • 48
  • 66
  • 12
    The easiest way is to use limits – hadley Aug 03 '11 at 22:35
  • 2
    Could provide an example in this context Hadley? I'm not sure how to use limits with a factor. – Thierry Aug 05 '11 at 09:10
  • @Thierry Thanks. I was happy to get responses on my first post. And thanks Thierry or adding in reproducible code as I should've in my post...My categorical variables were the right type - factors. The other issue is I want the legend not to show unused factors. R ignores unused character variables when building the legend. However, unused factors persist. If I drop them using: subdata$category <- factor(subdata$category)[drop=TRUE] then the legend has the right number of factors BUT losses the mapping. – wintour Aug 05 '11 at 19:37
  • @hadley Happy/honored to get a comment from you on my post. I see how to use limits if one of the variables is on the x or y axis but haven't yet figured how to use them for other variables affecting the plot. – wintour Aug 05 '11 at 19:43
  • 15
    @Thierry - in my hands, using ggplot2_0.9.3.1, this method does not (any longer?) work; the colors assigned to the fCategory are different between the two plots. However, happily, @wintour, I figured that @hadley is suggesting that `+ scale_colour_discrete(drop=TRUE,limits = levels(dataset$fCategory))` to preserve the color|factor association but, which works, except, in my hands, the [drop=TRUE](http://docs.ggplot2.org/current/discrete_scale.html) is **NOT** being respected (I expect it to remove the level from the legend). Drat ... or is it me? – malcook Oct 30 '13 at 19:15
  • 2
    @malcook, instead of drop = TRUE, you need to specify which levels you want to keep via "breaks": https://github.com/hadley/ggplot2/issues/1433 – Eric Aug 28 '16 at 20:50
  • It needs to be stressed that this does not work (any more). – bers May 11 '21 at 12:13