How does R handle Unicode / UTF-8?

Question

If I write

`Δ` <- function(a,b)   (a-b)/a

then I can include U+394 so long as it's enclosed in backticks. (By contrast, Δ <- function(a,b) (a-b)/a fails with unexpected input in "�".) So apparently R parses UTF-8 or Unicode or something like that. The assignment goes well and so does the evaluation of eg

`Δ`(1:5, 9:13)

. And I can also evaluate Δ(1:5, 9:13).

Finally, if I defined something like winsorise <- function(x, λ=.05) { ... } then λ (U+3bb) doesn't need to be "introduced to" R with a backtick. I can then call winsorise(data, .1) with no problems.

The only mention in R's documentation I can find of unicode is over my head. Could someone who understands it better explain to me — what's going on "under the hood" when R needs the ` to understand assignment to ♔, but can parse ♔(a,b,c) once assigned?

In the R-internals: What R users think of as variables or objects are symbols which are bound to a value. I think the CHARSXP section you linked to is the value, and you are actually interested in the rules for symbols. That said, I've worked on R code written in Chinese, so I'd expect delta to work. — Neal Fultz, Feb 12 '15 at 17:30
What version of R are you using/what OS/what locale? I get "Error: \uxxxx sequences not supported inside backticks (line 1)" when assigning a function to `Δ` (Tested on the today's R-devel and 3.1.0 under Win 7, English UK locale.) — Richie Cotton, Feb 12 '15 at 17:37
What version of R are you using that `Δ <- function(a,b) (a-b)/a` fails? And when you say it "fails", what do you mean? Do you get a syntax error? If worked for me on `R version 3.1.0, x86_64-apple-darwin10.8.0 (64-bit)` locale en_US.UTF-8 — MrFlick, Feb 12 '15 at 17:37
@MrFlick @RichieCotton 3.1.2 "Pumpkin Helmet", says `Error: unexpected input in "�"`. — isomorphismes, Feb 12 '15 at 18:44
Δ doesn't work for me on `R version 3.1.1 (2014-07-10) Platform: x86_64-w64-mingw32/x64 (64-bit)` — Vlo, Feb 12 '15 at 18:47
@RichieCotton MrFlick Sorry, too late to edit that comment. Also using 32-bit Ubuntu, Irish English locale. Seems like a lot of the people getting errors are using 64-bit, I wonder if that's it? — isomorphismes, Feb 12 '15 at 19:34

drammock · Answer 1 · 2015-02-19T01:49:32.427

I can't speak to what's going on under the hood regarding the function calls vs. function arguments, but this email from Prof. Ripley from 2008 may shed some light (excerpt below):

R passes around, prints and plots UTF-8 character data pretty well, but it translates to the native encoding for almost all character-level manipulations (and not just on Windows). ?Encoding spells out the exceptions [...]

The reason R does this translation (on Windows at least) is mentioned in the documentation that the OP linked to:

Windows has no UTF-8 locales, but rather expects to work with UCS-2 strings. R (being written in standard C) would not work internally with UCS-2 without extensive changes.

The R documentation for ?Quotes explains how you can sometimes use out-of-locale characters anyway (emphasis added):

Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.

The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.

Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula.

There is another way to get at such characters, which is using the unicode escape sequence (like \u0394 for Δ). This is usually a bad idea if you're using that character for anything other than text on a plot (i.e., don't do this for variable or function names; cf. this quote from the R 2.7 release notes, when much of the current UTF-8 support was added):

If a string presented to the parser contains a \uxxxx escape invalid in the current locale, the string is recorded in UTF-8 with the encoding declared. This is likely to throw an error if it is used later in the session, but it can be printed, and used for e.g. plotting on the windows() device. So "\u03b2" gives a Greek small beta and "\u2642" a 'male sign'. Such strings will be printed as e.g. <U+2642> except in the Rgui console (see below).

I think this addresses most of your questions, though I don't know why there is a difference between the function name and function argument examples you gave; hopefully someone more knowledgable can chime in on that. FYI, on Linux all of these different ways of assigning and calling a function work without error (because the system locale is UTF-8, so no translation need occur):

Δ <- function(a,b) (a-b)/a         # no error
`Δ` <- function(a,b) (a-b)/a       # no error
"Δ" <- function(a,b) (a-b)/a       # no error
"\u0394" <- function(a,b) (a-b)/a  # no error
Δ(1:5, 9:13)        # -8.00 -4.00 -2.67 -2.00 -1.60
`Δ`(1:5, 9:13)      # same
"Δ"(1:5, 9:13)      # same
"\u0394"(1:5, 9:13) # same

sessionInfo()

# R version 3.1.2 (2014-10-31)
# Platform: x86_64-pc-linux-gnu (64-bit)

# locale:
# LC_CTYPE=en_US.UTF-8    LC_NUMERIC=C                LC_TIME=en_US.UTF-8
# LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8     LC_MESSAGES=en_US.UTF-8
# LC_PAPER=en_US.UTF-8    LC_NAME=C                   LC_ADDRESS=C
# LC_TELEPHONE=C          LC_MEASUREMENT=en_US.UTF-8  LC_IDENTIFICATION=C

# attached base packages:
# stats  graphics  grDevices  utils  datasets  methods  base

Thanks for doing all this research. I had no idea there was such a significant difference between Windows and Linux as far as UTF-8 goes, but that explains the people who commented with failures. (And I'm glad to know it's less likely 32-bit-ness than Windows-ness causing their errors.) — isomorphismes, Feb 18 '15 at 16:53
thanks @isomorphismes. I've just edited it slightly to make it more coherent, but the same basic information is there. In sum: any unicode-related weirdness is almost always the fault of Windows, but it's nothing to do with 32 vs 64 bit. It's all about UTF-8 vs. UCS-2 (FYI, if you do further reading, UCS-2 is also known as "UTF-16LE with BOM"). — drammock, Feb 19 '15 at 01:54
@isomorphismes Under R 3.1.2 (Win 7 64 bit) `get("Δ")(1,2)` will execute without error — mnel, Feb 19 '15 at 02:02

score 3 · Answer 2 · answered Feb 12 '15 at 17:58

For the record, under R-devel (2015-02-11 r67792), Win 7, English UK locale, I see:

options(encoding = "UTF-8")

`Δ` <- function(a,b) (a-b)/a 
## Error: \uxxxx sequences not supported inside backticks (line 1)

Δ <- function(a,b) (a-b)/a
## Error: unexpected input in "\"

"Δ" <- function(a,b) (a-b)/a      # OK

`Δ`(1:5, 9:13)
## Error: \uxxxx sequences not supported inside backticks (line 1)

Δ(1:5, 9:13)
## Error: unexpected input in "\"

"Δ"(1:5, 9:13)
## Error: could not find function "Î”"

OK interesting. I wouldn't have expected this to differ across versions. — isomorphismes, Feb 12 '15 at 18:45

How does R handle Unicode / UTF-8?

2 Answers2

Linked