0

I want to plot only a subset of my data, dependent on another column. I want to do this within ggplot, not by subsetting my data. As a simple example:

ggplot(mtcars, aes(x=hp, y=mpg) +
geom_point()

How would I get geom_point to only plot points with cyl == 4?

In my real data it’d be dependent on the value of another column being TRUE

Mike
  • 921
  • 7
  • 26

2 Answers2

1

In this case I think it's best to do the filtering inside the individual geom layers since they're all different subsets of the same data source. Here are a couple of options on how to do this. I think option 1 is much cleaner code.

Option 1

If you look at the documentation for any geom_*() function you'll see that there are actually 3 options for what to provide as data.

If NULL, the default, the data is inherited from the plot data as specified in the call to ggplot().

A data.frame, or other object, will override the plot data. All objects will be fortified to produce a data frame. See fortify() for which variables will be created.

A function will be called with a single argument, the plot data. The return value must be a data.frame, and will be used as the layer data. A function can be created from a formula (e.g. ~ head(.x, 10)).

This last option can be used here to perform additional manipulation/filtering on your data prior to using it in a particular geom_*() layer.

library(tidyverse)

# give function as data
mtcars %>% 
  mutate(newcol = cyl * wt) %>%
  rownames_to_column("car") %>%
    ggplot() +
      geom_point(data = ~filter(.x, cyl > 4 & qsec < 17),
                 aes(x = hp, y = mpg), color = "red") +
      geom_text(data = ~filter(.x, newcol < 10 | disp < 90),
                aes(x = hp, y = mpg, label = car))

Created on 2022-02-20 by the reprex package (v2.0.1)

Option 2

Second, you could use capture the output of {magrittr} pipe (%>%) as . and filter inside the geom_*()'s data argument. In order to prevent the output of %>% going in as the first argument you need to embrace the ggplot() call in curly braces {} and then also wrap the pipe output in curly braces, like this: {.}. In some cases, it will work fine without this treatment since data can be the first argument, but not always depending on how you construct this. Therefore it's safest to use the {} approach.

This somewhat unintuitive behaviour of the {magrittr} pipe is lightly documented here. There's also a nice explanation of it in this answer.

You can combine multiple conditions in the filter operation by connecting then with logical OR (|) or AND (&) operators.

library(tidyverse)

# works with or without {}
mtcars %>% 
  mutate(newcol = cyl * wt) %>%
  rownames_to_column("car") %>% 
  {
    ggplot() +
      geom_point(data = {.} %>% filter(cyl > 4 & qsec < 17),
                 aes(x = hp, y = mpg), color = "red") +
      geom_text(data = {.} %>% filter(newcol < 10 | disp < 90),
                aes(x = hp, y = mpg, label = car))
  }  

# error without {}
mtcars %>% 
  mutate(newcol = cyl * wt) %>%
  rownames_to_column("car") %>%
    ggplot() +
      geom_point(data = filter(., cyl > 4 & qsec < 17),
                 aes(x = hp, y = mpg), color = "red") +
      geom_text(data = filter(., newcol < 10 | disp < 90),
                aes(x = hp, y = mpg, label = car))
#> Error in filter(., cyl > 4 & qsec < 17): object '.' not found

Created on 2022-02-20 by the reprex package (v2.0.1)

Dan Adams
  • 4,971
  • 9
  • 28
  • Why do you need the curly braces? Seems to work without them – Mike Feb 20 '22 at 11:25
  • Actually, you're right, In this case it still works when you remove them. However depending on the order of arguments in each geom layer I think it might break in some cases, so this is still the safest way to write it. This is discussed in more detail [here](https://stackoverflow.com/questions/42385010/using-the-pipe-and-dot-notation/42386886#42386886), [here](https://qiita.com/aakansh9/items/c2c4d3f653162778b757), and [here](https://thatdatatho.com/tutorial-about-magrittrs-pipe-operator-and-placeholders/). – Dan Adams Feb 20 '22 at 14:51
0

Looks like this can be done with geom_point(data = . %>% filter(newcol>4), color="red")

Mike
  • 921
  • 7
  • 26