7

I am using R 3.2.0 with RStudio 0.98.1103 on Windows 7 64-bit. The Windows "regional and language settings" of my computer is English (United States).

For some reason the following code replaces my Czech characters "č" and "ř" by "c" and "r" in the text "Koryčany nad přehradou", when I read a XML file in utf-8 encoding from the web, parse the XML file to a list, and convert the list to a data.frame.

library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName

#this still displays correctly "Koryčany nad přehradou"
print(siteName) 

#make a data.frame from the list item. I suspect here is the problem.
df <- data.frame(name=siteName, id=1)

#now the Czech characters are lost. I see only "Korycany nad prehradou"
View(df) 

write.csv(df,"test.csv")
#the test.csv file also contains "Korycany nad prehradou" 
#instead of "Koryčany nad přehradou"

What is the problem? How do I make R to show my data.frame correctly with all the utf-8 special characters and save the .csv file without losing the "č" and "ř" Czech characters?

jirikadlec2
  • 1,256
  • 1
  • 23
  • 36
  • 1
    Can you change your locale to CZ and fix it that way? – thelatemail Apr 30 '15 at 02:14
  • a partial workaround for me was: `Sys.setLocale(locale="Czech")` to display the data.frame correctly as `Koryčany nad přehradou` in R. But now when I use `write.csv(df, "test.csv")` and open the test.csv in Excel or Notepad, the text shows up as: `Koryèany nad pøehradou` in the csv file. The only way to solve the problem was to open the csv file in Notepad++, and change encoding of the file to Windows-1250. – jirikadlec2 Apr 30 '15 at 02:23
  • I'm not convinced this is an exact duplicate. The other question seems to be focusing on a display issue, where this seems to actually be changing the storage of the data - the previous duplicate - http://stackoverflow.com/questions/17715956/why-do-some-unicode-characters-display-in-matrices-but-not-data-frames-in-r?lq=1 – thelatemail Apr 30 '15 at 03:04
  • @thelatemail The problem also appears with `View(df)`, as stated by the OP. –  Apr 30 '15 at 03:14
  • 3
    `write.csv` is just a wrapper for `write.table`, which has an `encoding` paremeter which defaults to `getOption("encoding")`. Have you tried changing that parameter? see `Encoding` section of `?file` for details. – Jthorpe Apr 30 '15 at 04:03
  • 1
    i think this is an excel issue. Do not open the `csv` file but open an excel then `Data -> From Text -> Delimited` and in the `file origin` select something the can read Czech characters. The other workaround should you change your system settings (in windows) to czech. Let me know how it goes. – dimitris_ps Apr 30 '15 at 08:06
  • 1
    The following workaround solved my problem on English Windows 7: (1) before making the data.frame run `Sys.setlocale(locale="Czech")` (2) now make the data.frame `df <- data.frame(name=siteName, id=1)` (3) View(df) now displays correctly (4) use fileEncoding="UTF-8" in write.table `write.table(df, "test-czech-utf-8.txt", sep="\t", fileEncoding="UTF-8")` and now the file shows up correctly when I open it in notepad and in Excel. (4) set the R locale back to the original `Sys.setlocale(locale="English")` – jirikadlec2 Apr 30 '15 at 16:37
  • @jirikadlec2 - well done - can you post that as an answer to your own question and accept it so that future searchers with the same problem can see that this is the case? – thelatemail Apr 30 '15 at 22:21

1 Answers1

4

This is not a perfect answer, but the following workaround solved the problem for me. I tried to understand the behavior or R, and make the example so that my R script produces the same results both on Windows and on Linux platform:

(1) Get XML data in UTF-8 from the Internet

library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName

(2) Print out the text from the Internet: Encoding is UTF-8, display in the R console is also correct using both the Czech and the English locale on Windows:

> Sys.getlocale(category="LC_CTYPE")
[1] "English_United States.1252"
> print(siteName)
[1] "Koryčany nad přehradou"
> Encoding(siteName)
[1] "UTF-8"
> 

(3) Try to create and view a data.frame. This has a problem. The data.frame displays incorrectly both in the RStudio view and in the console:

df <- data.frame(name=siteName, id=1)
df
                    name id
1 Korycany nad prehradou  1

(4) Try to use a matrix instead. Surprisingly the matrix displays correctly in the R console.

m <- as.matrix(df)
View(m)  #this shows incorrectly in RStudio
m        #however, this shows correctly in the R console.
     name                     id 
[1,] "Koryčany nad přehradou" "1"

(5) Change the locale. If I'm on Windows, set locale to Czech. If I'm on Unix or Mac, set locale to UTF-8. NOTE: This has some problems when I run the script in RStudio, apparently RStudio doesn't always react immediately to the Sys.setlocale command.

#remember the original locale.
original.locale <- Sys.getlocale(category="LC_CTYPE")

#for Windows set locale to Czech. Otherwise set locale to UTF-8
new.locale <- ifelse(.Platform$OS.type=="windows", "Czech_Czech Republic.1250", "en_US.UTF-8")
Sys.setlocale("LC_CTYPE", new.locale) 

(7) Write the data to a text file. IMPORTANT: don't use write.csv but instead use write.table. When my locale is Czech on my English Windows, I must use the fileEncoding="UTF-8" in the write.table. Now the text file shows up correctly in notepad++ and in also in Excel.

write.table(m, "test-czech-utf8.txt", sep="\t", fileEncoding="UTF-8")

(8) Set the locale back to original

Sys.setlocale("LC_CTYPE", original.locale)

(9) Try to read the text file back into R. NOTE: If I read the file, I had to set the encoding parameter (NOT fileEncoding !). The display of a data.frame read from the file is still incorrect, but when I convert my data.frame to a matrix the Czech UTF-8 characters are preserved:

data.from.file <- read.table("test-czech-utf8.txt", sep="\t", encoding="UTF-8")
#the data.frame still has the display problem, "č" and "ř" get "lost"
> data.from.file
                     name id
1 Korycany nad prehradou  1

#see if a matrix displays correctly: YES it does!
matrix.from.file <- as.matrix(data.from.file)
> matrix.from.file
  name                     id 
1 "Koryčany nad přehradou" "1"

So the lesson learnt is that I need to convert my data.frame to a matrix, set my locale to Czech (on Windows) or to UTF-8 (on Mac and Linux) before I write my data with Czech characters to a file. Then when I write the file, I must make sure fileEncoding must be set to UTF-8. On the other hand when I later read the file, I can keep working in the English locale, but in read.table I must set the encoding="UTF-8".

If anybody has a better solution, I'll welcome your suggestions.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
jirikadlec2
  • 1,256
  • 1
  • 23
  • 36