R: accented characters in data frame

Question

I'm confused about why certain characters (e.g. "Ě", "Č", and "ŝ") lose their diacritical marks in a data frame, while others (e.g. "Š" and "š") do not. My OS is Windows 10, by the way. In my sample code below, a vector czechvec has 11 single-character strings, all Slavic accented characters. R displays those characters properly. Then a data frame mydf is created with czechvec as the second column (the function I() is used so it won't be converted to a factor). But then when R displays mydf or any row of mydf, it converts most of these characters to their plain-ascii equivalent; e.g. mydf[3,] shows the character as "E" not "Ě". But subscripting with row and column, e.g. mydf[3,2], it properly shows the accented character ("Ě"). Why should it make a difference whether R displays the whole row or just one cell? And why are some characters like "Š" completely unaffected? Also when I write this data frame to a file, it completely loses the accent, even though I specify fileEncoding="UTF-8".

> charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
> hexvals  <- as.hexmode(charvals)
> czechvec <- unlist(strsplit(intToUtf8(charvals), ""))
> czechvec
[1] "Á" "č" "Ě" "Č" "Ć" "ć" "Ŝ" "ŝ" "Ş" "Š" "š"
> 
> mydf = data.frame(dec=charvals, char=I(czechvec), hex=I(format(hexvals, width=4, upper.case=TRUE)))
> mydf
   dec char  hex
1  193    Á 00C1
2  269    c 010D
3  282    E 011A
4  268    C 010C
5  262    C 0106
6  263    c 0107
7  348    S 015C
8  349    s 015D
9  350    S 015E
10 352    Š 0160
11 353    š 0161
> mydf[3,2]
[1] "Ě"
> mydf[3,]
  dec char  hex
3 282    E 011A
> 
> write.table(mydf, file="myfile.txt", fileEncoding="UTF-8")
> 
> df2 <- read.table("myfile.txt", stringsAsFactors=FALSE, fileEncoding="UTF-8")
> df2[3,2]
[1] "E"

Edited to add: Per Ernest A's answer, this behaviour is not reproducible in Linux. It must be a Windows issue. (I'm using R 3.4.1 for Windows.)

score 1 · Answer 1 · answered Sep 10 '17 at 12:03

I cannot reproduce this behaviour, using R version 3.3.3 (Linux).

> data.frame(dec=charvals, char=I(czechvec), hex=I(format(hexvals, width=4, upper.case=TRUE)))
   dec char  hex
1  193    Á 00C1
2  269    č 010D
3  282    Ě 011A
4  268    Č 010C
5  262    Ć 0106
6  263    ć 0107
7  348    Ŝ 015C
8  349    ŝ 015D
9  350    Ş 015E
10 352    Š 0160
11 353    š 0161

Montgomery Clift · Answer 2 · 2017-09-12T18:24:25.347

Thanks to Ernest A's answer checking that the weird behaviour I observed does not occur in Linux, I Googled R WINDOWS UTF-8 BUG which led me to this article by Ista Zahn: Escaping from character encoding hell in R on Windows

The article confirms there is a bug in the data.frame print method on Windows, and gives some workarounds. (However, the article doesn't note the issue with write.table in Windows, for data frames with foreign-language text.)

One workaround suggested by Zahn is to change the locale to suit the particular language we are working with:

Sys.setlocale(category = "LC_CTYPE", locale = "czech")
charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
hexvals  <- format(as.hexmode(charvals), width=4, upper.case=TRUE)
df1      <- data.frame(dec=charvals, char=I(unlist(strsplit(intToUtf8(charvals), ""))), hex=I(hexvals))

print.listof(df1)

dec :
 [1] 193 269 282 268 262 263 348 349 350 352 353

char :
 [1] "Á" "č" "Ě" "Č" "Ć" "ć" "Ŝ" "ŝ" "Ş" "Š" "š"

hex :
 [1] "00C1" "010D" "011A" "010C" "0106" "0107" "015C" "015D" "015E" "0160"
[11] "0161"

df1
   dec char  hex
1  193    Á 00C1
2  269    č 010D
3  282    Ě 011A
4  268    Č 010C
5  262    Ć 0106
6  263    ć 0107
7  348    S 015C
8  349    s 015D
9  350    Ş 015E
10 352    Š 0160
11 353    š 0161

Notice that the Czech characters are now displayed correctly but not "Ŝ" and "ŝ", Unicode U+015C and U+015D, which apparently are used in Esperanto. But with the print.listof command, all the characters are displayed correctly. (By the way, dput(df1) lists the Esperanto characters incorrectly, as "S" and "s".)

write.table(df1, file="special characters example.txt", fileEncoding="UTF-8")
df2 <- read.table("special characters example.txt", stringsAsFactors=FALSE, fileEncoding="UTF-8")

print.listof(df2)
dec :
 [1] 193 269 282 268 262 263 348 349 350 352 353

char :
 [1] "Á" "č" "Ě" "Č" "Ć" "ć" "S" "s" "Ş" "Š" "š"

hex :
 [1] "00C1" "010D" "011A" "010C" "0106" "0107" "015C" "015D" "015E" "0160"
[11] "0161"

When I write.table df1 and then read.table it back as df2, the "Ŝ" and "ŝ" characters have lost their circumflex. This must be a problem with the write.table command, as confirmed when I open the file with a different application such as OpenOffice Writer. The Czech characters are all there correctly, but the "Ŝ" and "ŝ" have been changed to "S" and "s".

For the time being, the best workaround for my purposes is, instead of putting the actual character in my data frame, to record the Unicode value of it, then using write.table, and using the UNICHAR function in OpenOffice Calc to add the character itself to the file. But this is inconvenient.

I believe this same bug is relevant to this question: how to read data in utf-8 format in R?

Edited to add: Other similar questions I've now found on Stack Overflow:

Why do some Unicode characters display in matrices, but not data frames in R?

UTF-8 file output in R

Write UTF-8 files from R

And I found a workaround for the display issue by Peter Meissner here:

http://r.789695.n4.nabble.com/Unicode-display-problem-with-data-frames-under-Windows-tp4707639p4707667.html

It involves defining your own class unicode_df and print function print.unicode_df.

This still does not solve the issue I have with using write.table to write my data frame (which contains some columns with text in a variety of European languages) to a file that can be imported to a spreadsheet or any arbitrary application. But perhaps Meissner's solution can be adapted to work with write.table.

score 0 · Answer 3 · answered Sep 14 '17 at 07:13

Here's a function write.unicode.csv that uses paste and writeLines (with useBytes=TRUE) to export a data frame containing foreign-language characters (encoded in UTF-8) to a csv file. All cells in the data frame will be enclosed in quote marks in the csv file.

#function that will create a CSV file for a data frame containing Unicode text
#this can be used instead of write.csv in R for Windows
#source: https://stackoverflow.com/questions/46137078/r-accented-characters-in-data-frame
#this is not elegant, and probably not robust

write.unicode.csv <- function(mydf, filename="") {  #mydf can be a data frame or a matrix
   linestowrite <- character( length = 1+nrow(mydf) )
   linestowrite[1] <- paste('"","', paste(colnames(mydf), collapse='","'), '"', sep="") #first line will have the column names
   if(nrow(mydf)<1 | ncol(mydf)<1) print("This is not going to work.")        #a bit of error checking
   for(k1 in 1:nrow(mydf)) {
     r <- paste('"', k1, '"', sep="") #each row will begin with the row number in quotes
     for(k2 in 1:ncol(mydf)) {r <- paste(r, paste('"', mydf[k1, k2], '"', sep=""), sep=",")}
     linestowrite[1+k1] <- r
     }
   writeLines(linestowrite, con=filename, useBytes=TRUE)
   } #end of function

Sys.setlocale(category = "LC_CTYPE", locale = "usa")
charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
hexvals  <- format(as.hexmode(charvals), width=4, upper.case=TRUE)
df1      <- data.frame(dec=charvals, char=I(unlist(strsplit(intToUtf8(charvals), ""))), hex=I(hexvals))

print.listof(df1)

write.csv(df1, file="test1.csv")
write.csv(df1, file="test2.csv", fileEncoding="UTF-8")
write.unicode.csv(df1, filename="test3.csv")

dftest1 <- read.csv(file="test1.csv", encoding="UTF-8", colClasses="character")
dftest2 <- read.csv(file="test2.csv", encoding="UTF-8", colClasses="character")
dftest3 <- read.csv(file="test3.csv", encoding="UTF-8", colClasses="character")

print("CSV file written using write.csv with no fileEncoding parameter:")
print.listof(dftest1)

print('CSV file written using write.csv with fileEncoding="UTF-8":')
print.listof(dftest2)

print("CSV file written using write.unicode.csv:")
print.listof(dftest3)

R: accented characters in data frame

3 Answers3

Linked