61

The following, when copied and pasted directly into R works fine:

> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
> character_test()
[1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."

However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:

> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") : 
  C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
1: character.test <- function() print("R
2: 
  ^
In addition: Warning message:
In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
  invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'

Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.

> sessionInfo() # Windows 7 Pro x64
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

loaded via a namespace (and not attached):
[1] tools_2.12.1

and

> l10n_info()
$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] TRUE

$codepage
[1] 1252
Bernd Elkemann
  • 23,242
  • 4
  • 37
  • 66
Tony Breyal
  • 5,338
  • 3
  • 29
  • 49
  • 1
    Well, it seems to work well here. I run Linux with an UTF-8 locale. Maybe the problem comes from the locale on your system. Did you try to change it to an UTF-8 one ? – juba Feb 17 '11 at 19:05
  • Works on MacOS 10.6.6 as well. – ayman Feb 17 '11 at 22:31
  • @juba How would I go about changing R on windows to a UTF-8 local? – Tony Breyal Feb 18 '11 at 11:19
  • Well, my knowledge of Windows is quite limited, but maybe you can take a look at the `Sys.setlocale` R function, and find some informations in the R installation and administration guide : http://cran.r-project.org/doc/manuals/R-admin.html#Locales – juba Feb 18 '11 at 11:45
  • 1
    @juba - many thanks, but even after looking at that otherwise rather useful document, I can't see how to set it to a utf-8 local. – Tony Breyal Feb 21 '11 at 12:14
  • How did you create the file, and how do you know it's really in UTF-8 format? Do you know the characters in that file are correctly encoded? – hadley Feb 21 '11 at 16:55
  • @hadley file was created in notepad and saved by changing the encoding from ANSI to UTF-8. – Tony Breyal Feb 21 '11 at 19:28
  • @hadley I'm sure this is an R on Windows thing, it will work fine on Linux I'm sure. The file I've been working with (you can see it in my answer) just came from copying some sample Unicode text from some website offer such a thing. These text editors (Notepad, Notepad2, Notepad++), they can all encode UTF-8 easily enough. All this talk of locales seems bizarre to me (I'm just a Windows developer). On Windows you no longer worry about locales because we've stopped using the old ANSI API calls. Text is UTF-16LE and it all just works. I can't understand why there is a problem! – David Heffernan Feb 21 '11 at 20:19

7 Answers7

40

On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.

This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:

eval(parse(filename, encoding="UTF-8"))

This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.

ah bon
  • 9,293
  • 12
  • 65
  • 148
Joe Cheng
  • 8,001
  • 42
  • 37
  • I confirm that this works. `source()` requires setting `Sys.setlocale()` all along the file. `eval` does the job without this requirement. – Anton Tarasenko Dec 15 '13 at 07:05
  • 10
    `source` forwards the `encoding` argument to `file`, which, in turn, converts the textual input in memory to whatever locale encoding is specified (and fails) – this seems to be the culprit. `parse` by contrast doesn’t do this, it reads the file as-is and just marks the bytes in memory with the correct encoding. – I’m not entirely sure what this tells us, except that R’s internal handling of encodings is messy (we already knew that), and should be fixed, backwards compatibility be damned. – Konrad Rudolph Jun 26 '14 at 17:53
  • Is this still true in the latest R releases where UCRT is used to deal with the encoding in windows? – llrs Aug 03 '22 at 09:21
33

We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:

The file "myfile.r" contains:

russian <- function() print ("Американские с...");

The console contains:

source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."

Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Bernd Elkemann
  • 23,242
  • 4
  • 37
  • 66
  • 1
    Many thanks, this worked! I used Sys.setlocale("LC_CTYPE","chinese") – Tony Breyal Feb 21 '11 at 14:46
  • 1
    Anytime sir. ("chinese" not "Chinese", interesting how inconsistent they are good you found out) – Bernd Elkemann Feb 21 '11 at 15:00
  • how do you load a file that contains multiple languages? Something is wrong in R! – David Heffernan Feb 21 '11 at 17:01
  • 1
    You just switch the locale multiple times inside that file. I'm not sure the problem is with R, some commenters said that it's fine in Linux (without locale switching). It may-be R but it may be the Windows-API (widechar instead of utf-8) or a combination thereof. – Bernd Elkemann Feb 21 '11 at 18:49
  • @David @eznme Just saw this on the official R-help list, in which Prof Ripley says something about utf-8 locals on Windows: http://goo.gl/cUZCm – Tony Breyal Feb 21 '11 at 19:17
  • 6
    @Tony Prof. Ripley is talking out of his hat! Windows supports UTF-8 just fine. Windows has supported Unicode since 1991 and the reason it uses UTF-16 rather than UTF-8 as on Linux is that it supported Unicode before UTF-8 was even invented! My Windows app eats all these characters for breakfast. Locales should be irrelevant when you specify an encoding. I'm fingering `iconv` as the culprit here, but I'm afraid that if Prof. Ripley is taking that attitude then R on Windows has little hope of ever supporting Unicode properly. – David Heffernan Feb 21 '11 at 19:28
  • 1
    @eznme There just should be no need for locales. That might be how its done on Linux but it makes no sense in Windows. You just use the WideChar versions of all the API functions, hold the text as LPWSTR, and convert to different encodings at the boundaries (file import/export). It's not that difficult, but I understand that it becomes more difficult if you want to support Linux and Windows from a single codebase! – David Heffernan Feb 21 '11 at 19:44
  • 1
    @eznme Of course I can't get this locale thing to go because I can't select the ru locale on my machine. What a mess! – David Heffernan Feb 21 '11 at 19:47
  • 1
    The solution doesn't work for me. If I have this in my R source: `boxplot(weight~Diet,data=ChickWeight,subset = Time ==21,col = "yellow", main="Gewicht van kuikens in gram op dag 21 bij verschillende diëten", xlab="dieet", ylab="gewicht in gram", sub="bron:package datasets in R")` I still get `INCOMPLETE_STRING`. Also, is there a way to make r-studio source in utf-8 by default? – retorquere Feb 20 '17 at 14:02
6

For me (on windows) I do:

source.utf8 <- function(f) {
    l <- readLines(f, encoding="UTF-8")
    eval(parse(text=l),envir=.GlobalEnv)
}

It works fine.

crow16384
  • 587
  • 3
  • 15
6

I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following

danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")

is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.

Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
2

I encounter this problem when a try to source a .R file containing some Chinese characters. In my case, I found that merely set "LC_CTYPE" to "chinese" is not enough. But setting "LC_ALL" to "chinese" works well.

Note that it's not enough to get encoding right when you read or write plain text file in Rstudio (or R?) with non-ASCII. The locale setting counts too.

PS. the command is Sys.setlocale(category = "LC_CTYPE",locale = "chinese"). Please replace locale value correspondingly.

ah bon
  • 9,293
  • 12
  • 65
  • 148
user2473519
  • 191
  • 1
  • 4
2

Building on crow's answer, this solution makes RStudio's Source button work.

When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:

source <- function(f, encoding = 'UTF-8') {
    l <- readLines(f, encoding=encoding)
    eval(parse(text=l),envir=.GlobalEnv)
}

You can then add that script to an .Rprofile file, so it will execute on startup.

Domi
  • 22,151
  • 15
  • 92
  • 122
  • The `readLines` call is redundant. See Joe Cheng’s answer. Furthermore, when replacing the `source` function it’s a good idea to handle the remaining arguments, e.g. `local`, correctly. – Konrad Rudolph Feb 10 '21 at 20:09
1

On windows, when you copy-paste a unicode or utf-8 encoded string into a text-control that is set to single-byte-input (ascii... depending on locale), the unknown bytes will be replaced by questionmarks. If i take the first 4 characters of your string and copy-paste it into e.g. Notepad and then save it, the file becomes in hex:

52 3F 3F 3F 3F

what you have to do is find an editor which you can set to utf-8 before copy-pasting the text into it, then the saved file (of your first 4 characters) becomes:

52 E5 90 8C E6 97 B6 E4 B9 9F E8 A2 AB

This will then be recognized as valid utf-8 by [R].

I used "Notepad2" for trying this, but i am sure there are many more.

Bernd Elkemann
  • 23,242
  • 4
  • 37
  • 66
  • I just tried WinEdt (for which there is an often used R-Plugin RWinEdt) and it does not work (Version 5.5). So, you might want to try it with "Notepad2" first. You can also write the utf-8 text-file yourself using [R] writeChar(), i think it uses the encoding you set in Sys.setlocale(). – Bernd Elkemann Feb 20 '11 at 22:20
  • It doesn't matter which text editor writes the file, they can all write the file correctly, R on Windows just fails to read it. – David Heffernan Feb 20 '11 at 22:43
  • @David Heffernan The problem the original poster is having is different from your's. Yes, R can read UTF-8 files but the way his editor is set-up doesn't even create an UTF-8 file. He uses an editor that is not set to Utf-8-Mode and thus if he copies "R同时也" into it, the file becomes the bytes [52 3F 3F 3F] "R???". – Bernd Elkemann Feb 21 '11 at 09:30
  • 1
    @eznme I don't think so. OP states that the file is saved with UTF-8 encoding. I save the same file with UTF-8 encoding (or indeed UTF-16) and get the same error. The problem is with R. – David Heffernan Feb 21 '11 at 09:35
  • @eznme just take a look at my answer and try to get R to source the file with the Russian in! – David Heffernan Feb 21 '11 at 09:44
  • russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.") russian() [1] "Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями." – Bernd Elkemann Feb 21 '11 at 09:59
  • To do that use: Sys.setlocale("LC_CTYPE","ru") – Bernd Elkemann Feb 21 '11 at 09:59
  • @eznme Cheers, but as @David says, my file was originally saved in notepade, set to utf-8 format mode. I installed notepad2 to try it out (quite nice, thanks for mentioning it, didn't know about it before), changed it to utf-8 and still have the same issue. – Tony Breyal Feb 21 '11 at 11:54
  • @Tony Notepad2 is nice, Notepad++ is even nicer! – David Heffernan Feb 21 '11 at 12:13
  • @eznme @Tony What does locale have to do with anything? It's just a file read. Anyway, my machine says "OS reports request to set locale to "ru" cannot be honored". How did you get it to work? – David Heffernan Feb 21 '11 at 12:15
  • @David I actually agree with you in that the locale shouldn't matter because I'm specifically telling R to read in the file as utf-8 encoding, but I'm not an R expert and so am very willing to try different things out if they work. I get the same "cannot be honored" message as you. Also, just downloaded Notepad++ and very nice it is too! – Tony Breyal Feb 21 '11 at 12:31
  • @Tony Really, how can this be anything other than a bug in R, as I suggest in my answer? – David Heffernan Feb 21 '11 at 12:32
  • In my screenshot you can see that when i set the locale to "ru" the russian text displays correctly, when i set it to "German" it does not. – Bernd Elkemann Feb 21 '11 at 13:03
  • 1
    @eznme I don't see you calling source on a UTF-8 file with that text in in that screenshot. That's what doesn't work. The use of locales your are illustrating is for dealing with 8 bit character sets. A modern Unicode program uses Unicode text and so locales are only used for things like date/time/number formatting preferences. – David Heffernan Feb 21 '11 at 13:15
  • Yes. R 3.1.1 also can't do source(file, encoding="UTF-8") for Russian. – crow16384 Sep 27 '14 at 13:03