21

I want to do the opposite of this question, and sort of the opposite of this question, though that's about legends, not the plot itself.

The other SO questions seem to be asking about how to keep unused factor levels. I'd actually like mine removed. I have several name variables and several columns (wide format) of variable attributes that I'm using to create numerous bar plots. Here's a reproducible example:

library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
ggplot(df, aes(x=name,y=var1)) + geom_bar()

I get this:

enter image description here

I'd like only the names that have corresponding varn's show up in my bar plot (as in, there would be no empty space for B).

Reusing the base plot code will be quite easy if I can simply change my output file name and y=var bit. I'd like not have to subset my data frame just to use droplevels on the result for each plot if possible!


Update based on the na.omit() suggestion

Consider a revised data set:

library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5), var3=c(NA,6,7))
ggplot(df, aes(x=name,y=var1)) + geom_bar()

I need to use na.omit() for plotting var1 because there's an NA present. But since na.omit makes sure values are present for all columns, the plot removes A as well since it has an NA in var3. This is more analogous to my data. I have 15 total responses with NAs peppered about. I only want to remove factor levels that don't have values for the current plotted y vector, not that have NAs in any vector in the whole data frame.

Community
  • 1
  • 1
Hendy
  • 10,182
  • 15
  • 65
  • 71
  • Why do you need have a data-frame to begin with, when you're actually just plotting one row of it? If you didn't have the data frame (and, rather, had just a list/vector) you could just drop the NA fields). – Tilo Wiklund Jul 09 '12 at 21:15
  • @TiloWiklund: I'm an R novice, so feel free to suggest an alternative. I'm plotting a column of names against numerous different columns of data for a series of plots. Some columns are incomplete, some aren't. The incomplete ones are analogous to the above and leave gaps which I don't want since I only need to compare the variables that actually have data associated with them for that particular measured response. Does that make sense? – Hendy Jul 09 '12 at 21:18
  • 1
    You can also simply drop rows by setting a condition on only that column: `ggplot(df[!is.na(df$var1),], aes(x=name,y=var1)) + geom_bar()`. – joran Jul 09 '12 at 21:46
  • @joran: this seems quite similar to Tilo's solution below, though a bit simpler than passing two vector names. Regardless of the tweak on omitting na's, I guess the real lesson learned is that there's no way to do this automatically from ggplot. – Hendy Jul 09 '12 at 21:58
  • @TiloWiklund `ggplot()` expects a data frame as it's first argument. And anyway, isn;t he plotting two rows of it, one a factor (`name`) the other a numeric (`var1`)? Hendy needs to pass both variables otherwise how does `ggplot()` know to plot the values as two bars not a numeric vector of data? – Gavin Simpson Jul 10 '12 at 07:29
  • @GavinSimpson true, bad wording on my part. I should have said he used a fixed and finite number of columns. – Tilo Wiklund Jul 10 '12 at 10:52

3 Answers3

21

One easy options is to use na.omit() on your data frame df to remove those rows with NA

ggplot(na.omit(df), aes(x=name,y=var1)) + geom_bar()

Given your update, the following

ggplot(df[!is.na(df$var1), ], aes(x=name,y=var1)) + geom_bar()

works OK and only considers NA in Var1. Given that you are only plotting name and Var, apply na.omit() to a data frame containing only those variables

ggplot(na.omit(df[, c("name", "var1")]), aes(x=name,y=var1)) + geom_bar()
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • This is great and very easy. My actual data set must be missing one or more values from numerous columns as applying `na.omit` leaves me with a data frame with no rows... Any other suggestions? – Hendy Jul 09 '12 at 21:12
  • It would have been helpful to know that initially. See my updated Answer. – Gavin Simpson Jul 09 '12 at 22:14
  • I would have liked to specify that as well. Not knowing what solution would come up, I didn't realize that would be an issue. Honestly, I expected there to be a ggplot option to drop things. Given that people want to *keep* unused levels in the linked questions and are able to specify `drop=FALSE`, I kind of wondered why `drop=T` wouldn't do exactly what I wanted! Thanks for the updated answer. – Hendy Jul 09 '12 at 22:18
  • @Hendy The level `B` is not unused in your example. It is very much used as it is present in the data. The `NA` is just as valid a data point as any other value as far as R is concerned. A truly unused level would be the `B` in `A <- factor(c("a","c"), levels = c("a","b","c"))`. In `A`, the level `b` is not present in the data. – Gavin Simpson Jul 10 '12 at 07:26
  • Sure, technically, I suppose. From my perspective, I have prototypes and numerous measured test results. There is no data at the intersection of prototype `B` and test method `var1`. My data frame is composed of a column of prototype names and columns of test data. Wide format. "Truly unused levels" are only possible in long, right? – Hendy Jul 10 '12 at 07:35
  • You are missing the point that there *is* data at the intersection of `B` and `var1`; we just don't have the value of that data available to us. Re wide vs long, again that is not strictly correct; for example, if we do `df[!is.na(df$var1), ]` the variable `name` in the resulting data frame *is* a factor with the same levels as the full data set and hence it now does have a truly un-used level, that of `B`. Un-used levels can crop up in any factor. – Gavin Simpson Jul 10 '12 at 07:51
  • Put another way, forget about `var1` as that has nothing to do with the un-used level. If there are no elements of a factor that correspond to one or more levels of the factor then those levels are considered un-used. This is independent of the data structure or format. – Gavin Simpson Jul 10 '12 at 07:52
  • I guess my point was that if my data was in long format with vectors `name`, `var` and `value`, I could create a "truly unused factor" without massaging data with `!is.na`. That's not possible in my current data arrangement, correct? (With the above, I have no choice but to have a "blank" at the intersection of `name` and `var1`. In long form I just wouldn't have a row in which `name=B`, `var=1`. – Hendy Jul 12 '12 at 14:32
6

Notice that, when plotting, you're using only two columns of your data frame, meaning that, rather than passing your whole data.frame you could take the relevant columns x[,c("name", "var1")] apply na.omit to remove the unwanted rows (as Gavin Simpson suggests) na.omit(x[,c("name", "var1")]) and then plot this data.

My R/ggplot is quite rusty, and I realise that there are probably cleaner ways to achieve this.

Tilo Wiklund
  • 751
  • 1
  • 10
  • 15
  • Didn't realize I could do that from within ggplot, but it now seems obvious. This definitely works. It would be great if there was an equivalent to `drop=T` or `scale="free"` as I'll have to tweak all of my plot functions this way. Shouldn't b too bad and I'll just use `dat[, c(1,n)]` so that I can just iterate through each without much hassle. Thanks! – Hendy Jul 09 '12 at 21:37
2

A lot of time has passed since this question was originally asked. In 2021 if I was handling this I would use something like:

library(ggplot2)
library(tidyr)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))

df %>% 
  drop_na(var1) %>% 
  ggplot(aes(name, var1)) +
  geom_col()

Created on 2021-12-03 by the reprex package (v2.0.1)

John-Henry
  • 1,556
  • 8
  • 20
  • 1
    This is awesome! I think there are *a lot* of questions on SO like this. I just did something similar to another question, where all of the answers now seemed fiddly and tedious vs. `dplyr`. Thanks for taking the time to `modernize()` :) – Hendy Dec 04 '21 at 18:55