11

I am reading a file through RJDBC from a MySQL database and it correctly displays all letters in R (e.g., נווה שאנן). However, even when exporting it using write.csv and fileEncoding="UTF-8" the output looks like <U+0436>.<U+043A>. <U+041B><U+043E><U+0437><U+0435><U+043D><U+0435><U+0446>(in this case this is not the string above but a Bulgarian one) for Bulgarian, Hebrew, Chinese and so on. Other special characters like ã,ç etc work fine.

I suspect this is because of UTF-8 BOM but I did not find a solution on the net

My OS is a German Windows7.

edit: I tried

con<-file("file.csv",encoding="UTF-8")
write.csv(x,con,row.names=FALSE)

and the (afaik) equivalent write.csv(x, file="file.csv",fileEncoding="UTF-8",row.names=FALSE).

deceze
  • 510,633
  • 85
  • 743
  • 889
Arthur G
  • 1,397
  • 1
  • 14
  • 23
  • 3
    Are you saying that when you open the exported file, you see "U+0436" instead of "ж"? If so that's no BOM issue, just an issue of the Unicode code points not being encoded into a UTF encoding, but output as code points. Maybe show us some code how exactly you're exporting the file? – deceze Sep 13 '11 at 13:29
  • I added information on how I exported the file. And yes, I see "" instead of "ж" – Arthur G Sep 14 '11 at 08:29
  • 1
    Seeing "" in the file is ambiguous (it could even mean that those characters are actually inlined in that file or your editor just cannot display them). You could either write us the "ж" in a file and tell us the hex-values of all the characters the generated file contains (open it in a hex-editor); OR give us the code to reproduce your problem (of course we dont have your DB, so create a vector with the sample data). – Bernd Elkemann Sep 14 '11 at 10:19

2 Answers2

11

The accepted answer did not help me in a similar application (R 3.1 in Windows, while I was trying to open the file in Excel). Anyway, based on this part of file documentation:

If a BOM is required (it is not recommended) when writing it should be written explicitly, e.g. by writeChar("\ufeff", con, eos = NULL) or writeBin(as.raw(c(0xef, 0xbb, 0xbf)), binary_con)

I came up with the following workaround:

write.csv.utf8.BOM <- function(df, filename)
{
    con <- file(filename, "w")
    tryCatch({
    for (i in 1:ncol(df))
        df[,i] = iconv(df[,i], to = "UTF-8") 
    writeChar(iconv("\ufeff", to = "UTF-8"), con, eos = NULL)
    write.csv(df, file = con)
    },finally = {close(con)})
}

Note that df is the data.frame and filename is the path to the csv file.

rmojab63
  • 3,513
  • 1
  • 15
  • 28
  • 2
    This is great. This should be the accepted answer (Windows 7, R version 3.4.2) – TaylorV Jun 20 '18 at 16:10
  • 2
    Still going fine on R 3.5.3. Just two small remarks: instead of the `tryCatch()` construct you could just use `on.exit(close(con))`.It might also be useful to pass `fileEncoding = "utf-8"` to `write.csv()` for best results. – Stefan F Apr 30 '19 at 11:40
6

On help page to Encoding (help("Encoding")) you could read about special encoding - bytes.

Using this I was able to generate csv file by:

v <- "נווה שאנן"
X <- data.frame(v1=rep(v,3), v2=LETTERS[1:3], v3=0, stringsAsFactors=FALSE)

Encoding(X$v1) <- "bytes"
write.csv(X, "test.csv", row.names=FALSE)

Take care about differences between factor and character. The following should work:

id_characters <- which(sapply(X,
    function(x) is.character(x) && Encoding(x)=="UTF-8"))
for (i in id_characters) Encoding(X[[i]]) <- "bytes"

id_factors <- which(sapply(X,
    function(x) is.factor(x) && Encoding(levels(x))=="UTF-8"))
for (i in id_factors) Encoding(levels(X[[i]])) <- "bytes"

write.csv(X, "test.csv", row.names=FALSE)
Marek
  • 49,472
  • 15
  • 99
  • 121