This is not a perfect answer, but the following workaround solved the problem for me. I tried to understand the behavior or R, and make the example so that my R script produces the same results both on Windows and on Linux platform:
(1) Get XML data in UTF-8 from the Internet
library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName
(2) Print out the text from the Internet: Encoding is UTF-8, display in the R console is also correct using both the Czech and the English locale on Windows:
> Sys.getlocale(category="LC_CTYPE")
[1] "English_United States.1252"
> print(siteName)
[1] "Koryčany nad přehradou"
> Encoding(siteName)
[1] "UTF-8"
>
(3) Try to create and view a data.frame. This has a problem. The data.frame displays incorrectly both in the RStudio view and in the console:
df <- data.frame(name=siteName, id=1)
df
name id
1 Korycany nad prehradou 1
(4) Try to use a matrix instead. Surprisingly the matrix displays correctly in the R console.
m <- as.matrix(df)
View(m) #this shows incorrectly in RStudio
m #however, this shows correctly in the R console.
name id
[1,] "Koryčany nad přehradou" "1"
(5) Change the locale. If I'm on Windows, set locale to Czech. If I'm on Unix or Mac, set locale to UTF-8. NOTE: This has some problems when I run the script in RStudio, apparently RStudio doesn't always react immediately to the Sys.setlocale command.
#remember the original locale.
original.locale <- Sys.getlocale(category="LC_CTYPE")
#for Windows set locale to Czech. Otherwise set locale to UTF-8
new.locale <- ifelse(.Platform$OS.type=="windows", "Czech_Czech Republic.1250", "en_US.UTF-8")
Sys.setlocale("LC_CTYPE", new.locale)
(7) Write the data to a text file. IMPORTANT: don't use write.csv
but instead use write.table
. When my locale is Czech
on my English Windows, I must use the fileEncoding="UTF-8"
in the write.table
. Now the text file shows up correctly in notepad++ and in also in Excel.
write.table(m, "test-czech-utf8.txt", sep="\t", fileEncoding="UTF-8")
(8) Set the locale back to original
Sys.setlocale("LC_CTYPE", original.locale)
(9) Try to read the text file back into R. NOTE: If I read the file, I had to set the encoding
parameter (NOT fileEncoding !). The display of a data.frame read from the file is still incorrect, but when I convert my data.frame
to a matrix
the Czech UTF-8 characters are preserved:
data.from.file <- read.table("test-czech-utf8.txt", sep="\t", encoding="UTF-8")
#the data.frame still has the display problem, "č" and "ř" get "lost"
> data.from.file
name id
1 Korycany nad prehradou 1
#see if a matrix displays correctly: YES it does!
matrix.from.file <- as.matrix(data.from.file)
> matrix.from.file
name id
1 "Koryčany nad přehradou" "1"
So the lesson learnt is that I need to convert my data.frame
to a matrix
, set my locale to Czech
(on Windows) or to UTF-8
(on Mac and Linux) before I write my data with Czech characters to a file. Then when I write the file, I must make sure fileEncoding
must be set to UTF-8. On the other hand when I later read the file, I can keep working in the English locale, but in read.table
I must set the encoding="UTF-8"
.
If anybody has a better solution, I'll welcome your suggestions.