41

I know it is preferred if variable names do not have spaces in them. I have a situation where I need publication-quality charts, so axes and legends need to have properly formatted labels, ie with spaces. So, for example, in development I might have variables called "Pct.On.OAC" and Age.Group, but in my final plot I need "% on OAC" and "Age Group" to appear:

'data.frame':   22 obs. of  3 variables:
 $ % on OAC           : Factor w/ 11 levels "0","0.1-9.9",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Age Group          : Factor w/ 2 levels "Aged 80 and over",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Number of Practices: int  47 5 33 98 287 543 516 222 67 14 ...

But when I try to plot these:

ggplot(dt.m, aes(x=`% on OAC`,y=`Number of Practices`, fill=`Age Group`)) +
    geom_bar()
)

no problem with that. But when I add a facet:

ggplot(dt.m, aes(x=`% on OAC`,y=`Number of Practices`, fill=`Age Group`)) +
    geom_bar() +
    facet_grid(`Age Group`~ .) 

I get Error in[.data.frame(base, names(rows)) : undefined columns selected

If I change Age Group to Age.Group then it works fine, but as I said, I don't want the dot to appear in the title legend.

So my questions are:

  1. Is there a workaround for the problem with the facet ?
  2. Is there a better general approach to dealing with the problem of spaces (and other characters) in variable names when I want the final plot to include them ? I suppose I can manually overide them, but that seems like a lot of faffing around.
Robert Long
  • 5,722
  • 5
  • 29
  • 50

4 Answers4

29

You asked "Is there a better general approach to dealing with the problem of spaces (and other characters) in variable names" and yes there are a few:

  • Just don't use them as things will break as you experienced here
  • Use the make.names() function to create safe names; this is used by R too to create identifiers (eg by using underscores for spaces etc)
  • If you must, protect the unsafe identifiers with backticks.

Example for the last two points:

R> myvec <- list("foo"=3.14, "some bar"=2.22)
R> myvec$'some bar' * 2
[1] 4.44
R> make.names(names(myvec))
[1] "foo"      "some.bar"
R> 
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • yes, but in this case (because ggplot does some extra evaluation), protecting with backticks doesn't work, so we're back to your point #1 ... – Ben Bolker Oct 05 '12 at 12:34
  • Sure, as one cannot (easily) alter all other packages. There is a reason I ranked them the way I did. Backticks is the last resort. – Dirk Eddelbuettel Oct 05 '12 at 13:13
  • Not sure if this was recently updated, but underscores are NOT used for spaces anymore in R version 4.2.2, instead it replaces spaces with period '.' – Raleigh L. Nov 08 '22 at 00:36
  • 1
    @RaleighL. that is what my answer from ten years ago shows: `"some bar"` becomes `"some.bar"`. – Dirk Eddelbuettel Nov 08 '22 at 01:47
  • Ah okay, I was addressing this part: `(eg by using underscores for spaces etc)` – Raleigh L. Nov 08 '22 at 05:02
21

This is a "bug" in the package ggplot2 that comes from the fact that the function as.data.frame() in the internal ggplot2 function quoted_df converts the names to syntactically valid names. These syntactically valid names cannot be found in the original dataframe, hence the error.

To remind you :

syntactically valid names consists of letters, numbers and the dot or underline characters, and start with a letter or the dot (but the dot cannot be followed by a number)

There's a reason for that. There's also a reason why ggplot allows you to set labels using labs, eg using the following dummy dataset with valid names:

X <-data.frame(
  PonOAC = rep(c('a','b','c','d'),2),
  AgeGroup = rep(c("over 80",'under 80'),each=4),
  NumberofPractices = rpois(8,70)
  ) 

You can use labs at the end to make this code work

ggplot(X, aes(x=PonOAC,y=NumberofPractices, fill=AgeGroup)) +
  geom_bar() +
  facet_grid(AgeGroup~ .) + 
  labs(x="% on OAC", y="Number of Practices",fill = "Age Group")

To produce

enter image description here

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • 2
    PS : As @DirkEddelbuettel points out, afaik the function `as.data.frame` uses the function `make.names()` internally to "correct" those names (i.e. create valid identifiers). – Joris Meys Oct 05 '12 at 11:49
3

A simple solution to multi-word column names is to simply separate them with an underscore character. It has some advantages over other conventions:

  • _ An underscore in a column name is valid
  • And underscore separates the words for readability
  • Camelcase can be tricky to read (consider s vs S and w vs W - similar letters can cause confusion, which can be problematic since R is case sensitive)
  • Using a period (.) in a column name is valid but often not ideal from a readability perspective, especially for anyone from languages other than R who may mistake the period for a method call (e.g. data.test could be a column name in R, but could look like the .test method is being called on the object data if someone is used to reading other languages, like ruby or python)
  • Using spaces in column names is valid, but when referencing those columns, it will be necessary to surround the column name with backticks i.e. the ` symbol
    • e.g. iris[ , Sepal Length`]

TL;DR Use the underscore to separate words in column names and you shouldn't have any problems (avoid spaces in column names, and if you data already has some, surround the full column name with backticks ` when referring to it in functions)

stevec
  • 41,291
  • 27
  • 223
  • 311
-1
library("data.table", lib.loc = "~/R/win-library/3.5")

names(inv01)

[1] "INV_YEAR"  "TREE_NO"   "DBH 2019"  "HT 2019" 

inv01tmp<-inv01[,list(DBH=`DBH 2019`,HT=`HT 2019`)]

enter image description here

Andrew Taylor
  • 3,438
  • 1
  • 26
  • 47
LUIS
  • 1