2

I thought this would be easy ... but couldn't find a solution. I am trying to generate a ggplot2 in R with correlation between col1 and col2, and size of the dot with col3, and shape with col4. col3 and col4 has NA/missing values. When running the code below, ggplot2 removes the rows without a matching col3 and/or col4, however, I want to keep these and color code. Output below

Example dataframe:

Warning: Removed 3 rows containing missing values (geom_point).

  1. I tried to create another geom_point with is.na(df$col3 | df$col4) but that wouldn't work.
  2. tried adding na.rm=FALSE in
geom_point(aes(size=df$col3, col=df$col4), na.rm=FALSE)

  1. tried
scale_size(range = c(0.25,4), na.value = 0) #to give a 0 value to the na.value (although would rather not)

But, I ended with "Ignoring unknown aesthetics: na.rm" for #2 and #3, and #1 gave an error. Also, that doesn't fix the issue that col4 shapes are being removed too

ggplot(df, aes(x=df$col1, y=df$col2)) + 
    geom_point(aes(size=df$col3, col=df$col4), na.rm=FALSE) + 
    theme_classic() + 
    scale_size(range = c(0.25,4)) 
             
+-------------+-------------+-------------+----------+
|    col1     |    col2     |    col3     |   col4   |
+-------------+-------------+-------------+----------+
| 0.254393811 | 0.124242905 | NA          | NA       |
|  0.28223149 | 0.148601748 | 0.236953099 | CD8CTL   |
| 0.205945835 | 0.074541695 | NA          | NA       |
| 0.199758631 | 0.103369485 | NA          | CD8Mem   |
|   0.2798128 | 0.109511863 | 0.396113132 | CD8STAT1 |
| 0.254616042 | 0.059495241 | 0.479590212 | CD8CTL   |
| 0.197929395 |  0.10993698 | 0.272611442 | CD8CTL   |
| 0.294888359 |  0.12319682 | 0.16069263  | CD8CTL   |
| 0.191407446 | 0.086443936 | 0.36596486  | CD8CTL   |
| 0.267533392 |  0.11240525 | 0.344659516 | CD8CTL   |
+-------------+-------------+-------------+----------+

Out of the 10 rows, only subset shows that are complete

BioProgram
  • 684
  • 2
  • 13
  • 28

2 Answers2

4

There's a few things to note - I think I have understood what the OP is looking to do here. In this case, you want all points to plot. I'm going to state how we want the plot to look:

  • col1 is used to plot x axis
  • col2 is used to plot y axis
  • col3 is used to control the size of the point
  • col4 is used to control the color of the point

We have NA values in col3 and col4. So what to do with those? Well, for color, I'm going to have those labeled and include them in the legend color-coded and labeled as "NA". What about for size? Well, size=NA doesn't make any sense, so I think the best thing to do for df$col3 == NA is going to be to change the shape. Here's what I've done:

ggplot(df, aes(x=col1, y=col2, color=col4)) +
  geom_point(aes(size=col3, shape='Not NA')) +
  geom_point(data=subset(df, is.na(col3)), aes(shape='NA'), size=3) +
  scale_shape_manual(values=c('NA'=3, 'Not NA'=19)) +
  theme_classic()

enter image description here

First of all, it's bad form to reference columns via data.frame$column.name - you should use just the column name itself.

Color is easy - we just put color=col4 in the top aes() specification, since it's applied to every geom.

For the shape, it's probably easiest here to specify in two separate calls to geom_point(). One is without any specification, which will naturally remove any NAs - you won't get points plotted with size=NA. To "add back in" the NA points, we have to specifically pull those out and specify a size. Finally, in order to get the shape aesthetic inside a legend, we need to put it inside the aes(). The general rule here is that if you set an aesthetic equal to the column name inside aes(), it will use the values inside that column for labelling. If you just type a character inside aes() like we did here, you will have all items in that geom call labeled with that character - but the legend is created. So, we basically are creating our own custom legend for shape here.

Then it's just a matter of using scale_shape_manual() and a named vector for the values argument to set the actual shape we want to use.

EDIT

Thinking about this a bit more, it doesn't make sense for NA to appear in the legend for color and shape, so let's remove it from color. That's done by completely separating the dataset that includes NAs in col3 from the one that doesn't:

ggplot(df, aes(x=col1, y=col2, color=col4)) +
  geom_point(data=subset(df, !is.na(col3)), aes(size=col3, shape='Not NA')) +
  geom_point(data=subset(df, is.na(col3)), aes(shape='NA'), size=3) +
  scale_shape_manual(values=c('NA'=3, 'Not NA'=19)) +
  theme_classic()

enter image description here

chemdork123
  • 12,369
  • 2
  • 16
  • 32
  • Thank you for this excellent response. I played with this a bit to try to fit my much larger dataset. This works. However, couple of questions for clarification: – BioProgram Apr 17 '21 at 21:13
  • Thank you for this excellent response. I played with this a bit to try to fit my much larger dataset. This works. However, couple of questions for clarification: 1) the shape = "Not NA" and "NA" are confusing. Is that a regular expression in R? I know NA is without the quotations. but 'Not NA' isn't defined in my data. 2) Is the scale_shape_manual just editing the shape sizes for the legend? 3) I added scale_size(range = c(0.25,4)) at the end to decrease the size of the dots (as I have 10,000 rows). Is this redundant with any functions? It didn't seem like it. Thanks so much. – BioProgram Apr 17 '21 at 21:19
  • `shape = "Not NA"` is not evaluated as a logical expression. It's just supplying a name for that entire dataset. The best way to think of it is this: normally `shape = col1` would be referencing a name of a column in the data, and then labels are drawn from that column data as a factor (or continuous). When you change that to `shape = "col1"`, it does not reference a column but rather... kind of "makes another one" in your dataset where all the values are equal to `"col1"`. If that was the case and you referenced that one column, you'd get everything labeled as "col1". Kind of like that.. – chemdork123 Apr 18 '21 at 13:02
  • Both of the `scale_` functions are specifying *how* you want `ggplot` to control that scale. The `aes()` part specifies how to map it out and then if you don't specify `scale_`, it will guess at the best way to differentiate. If you want to change the default shapes, colors, etc, you need to specify with some `scale_` function. So not redundant. – chemdork123 Apr 18 '21 at 13:04
-3

Please take a look to the following links:

http://naniar.njtierney.com/reference/geom_miss_point.html

Plotting missing values in ggplot2 with a separate line type

By the way, your explanation is clear on what you are trying to achieve. I see that the problem will be related to which shape and color to use when there is no value in Col3 and Col4. Maybe try solving it like this, like

When NAN in Col3 and Col4, color and shape is this for Col1 and Col2 correlation.

Another test would be to use geom_miss_point

Daniel
  • 218
  • 1
  • 2
  • 9