1

I'm trying to plot a boxplot in R using ggplot2.

here's my code with sample data:

df = structure(list(Closeness = c(0.0919540229885057, 0.0950259836674091, 0.0957367240089753, 0.0960240060015004, 0.0901408450704225, 0.0970432145564822, 0.0939794419970631, 0.0943952802359882, 0.0921526277897768, 0.0914285714285714, 0.0933625091174325, 0.0953090096798213, 0.0917562724014337, 0.0960960960960961, 0.0937728937728938, 0.0909090909090909, NA, 0.0946045824094605, 0.0864280891289669, 0.0879120879120879, 0.0905233380480905, 0.100313479623824, 0.0993017843289372, 0.0942562592047128, 0.0950965824665676, 0.0907801418439716, NA, NA, 0.0950965824665676, 0.0913633119200571, NA, 0.0926864590876177, NA, 0.0948148148148148, 0.0958801498127341, 0.0945347119645495, 0.0931586608442504, 0.090014064697609, 0.0968229954614221, 0.0963855421686747, 0.0926193921852388, 0.0919540229885057, 0.0947446336047372, 0.0917562724014337, 0.0905874026893135, 0.0950965824665676, NA, 0.0926193921852388, 0.0900774102744546, 0.0977845683728037), Var1 = c("Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group", "Group"), Var2 = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "A", "A", "K", "K", "G", "G", "N", "N", "O", "O", "A", "P", "P", "P", "Q", "Q", "Q", "Q", "A", "A", "A", "A", "R", "R", "R", "R", "S", "S", "S", "S", "L", "L", "L", "L", "L", "L", "L")), .Names = c("Closeness", "Var1", "Var2"), row.names = c(NA, 50L), class = "data.frame")

tmp <- data.frame(df, check.names=T)
tmp <- melt(tmp, id="Closeness", variable.name="Var1", value.name="Var2")
tmp$Var1 <- gsub("(.*)\\.[0-9]", "\\1", tmp$Var1)
df <- subset(tmp, Var2!="")

df_g = subset(df, Var1=="Group")
df_c = subset(df, Var1=="Cat")

ggplot(df_c, aes(x = df_g$Var2, y = df_g$Closeness), position = "dodge") + # geom_point() +
geom_boxplot(outlier.size = 1.5) #+ geom_jitter(position=position_jitter(width=.2, height=0))

Which produces this (with the full dataset):

enter image description here

Now, I have two problems:

  1. I'd like the categories (A, B, C, D) to be ordered by descending mean;
  2. Some categories only have one sample (ie. B, D, and E). I'd like to remove them before plotting.

Is this possible using ggplot2? If so, how to proceed?

Lucien S.
  • 5,123
  • 10
  • 52
  • 88

1 Answers1

5

Normally I'd comment and close as duplicate of, e.g.,

or pretty much anything that comes up if you search Stack Overflow for "ggplot2 order". If you want boxplot-specific examples (the method is the same), see

Or even this one which you asked less than 2 weeks ago. Different geom, same principle.

But, you also have some other issues, one of which is using data$column inside aes() which is a bit of pet peeve of mine, so let's address that too.

Don't use data$column inside aes()! It means you're not using the data argument correctly. Related: it's not clear at all why you start the plot with the empty data frame df_c, when df_g has everything you need:

ggplot(df_g, aes(x = Var2, y = Closeness), position = "dodge") + 
    geom_boxplot(outlier.size = 1.5) 

correctly using the data argument and not specifying data$column inside aes() will make sure your plot works right in all cases. If you use $ inside aes(), facets and other complex features probably will not work. If you need to use multiple data frames in one plot, do it at the layer level (e.g., geom_point(data = other_data, aes(x = x_var, y = y_var))). You still don't need to use $ inside aes().

As for your two stated problems, they are both solved by editing your data. ggplot is very good at plotting data, you just need to make your data look like what you want to plot.

I'd like the categories (A, B, C, D) to be ordered by descending mean;

Order the factor in your data!

df_g$Var2 = with(df_g, reorder(x = Var2, X = Closeness, FUN = function(x) -mean(x, na.rm = TRUE)))

Some categories only have one sample (i.e. B, D, and E). I'd like to remove them before plotting.

Okay, remove them! You could wholly remove them from your data or just subset the data that you give to the plot:

more_than_one = levels(df_g$Var2)[table(df_g$Var2) > 1]

ggplot(subset(df_g, Var2 %in% more_than_one), aes(Var2, Closeness)) +
    geom_boxplot()
Community
  • 1
  • 1
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294