13

I'm using R 2.15.0 on Windows 7 64-bit. I would like to output unicode (CJK) text to a file.

The following code shows how a Unicode character sent to write on a UTF-8 file connection does not work as (I) expected:

rty <- file("test.txt",encoding="UTF-8")
write("在", file=rty)
close(rty)
rty <- file("test.txt",encoding="UTF-8")
scan(rty,what=character())
close(rty)

As shown by the output of scan:

Read 1 item 
[1] "<U+5728>"

The file was not written with the UTF character itself, but some kind of ANSI-compliant fallback. Can I make it work right the first time (i.e. with a text file that has "在" in it instead), or can I work some extra magic to convert the output to Unicode with the proper character replacing the code string?

Thanks.

[More info: the same code behaves properly in Cygwin, R 2.14.2, while 2.14.2 on Win7 is also broken. Is this on my end somewhere?]

Patrick
  • 187
  • 1
  • 2
  • 7
  • [Belated update] The issues tend to be with *locale* rather than encoding. I have resolved gibberish output issues by temporarily changing locale to something "appropriate." God help you if you have language data from more than one locale. – Patrick Sep 05 '13 at 12:21
  • 1
    maybe this [post](http://stackoverflow.com/questions/11069908/r-extracting-clean-utf-8-text-from-a-web-page-scraped-with-rcurl?lq=1) will help. – DJJ Mar 01 '15 at 08:35

5 Answers5

24

The problem is due to some R-Windows special behaviour (using the default system coding / or using some system write functions; I do not know the specifics but the behaviour is actually known)

To write text UTF8 encoding on Windows one has to use the useBytes=T options in functions like writeLines or readLines:

txt <- "在"
writeLines(txt, "test.txt", useBytes=T)

readLines("test.txt", encoding="UTF-8")
[1] "在"

Find here a really well written article by Kevin Ushey: http://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ going into much more detail.

petermeissner
  • 12,234
  • 5
  • 63
  • 63
8

Saves UTF-8 strings in text file:

kLogFileName <- "parser.log"
log <- function(msg="") {
  con <- file(kLogFileName, "a")
  tryCatch({
    cat(iconv(msg, to="UTF-8"), file=con, sep="\n")
  },
  finally = {
    close(con)
  })
}
beloblotskiy
  • 948
  • 9
  • 7
  • Did this break in more recent R versions? When I write files this way, I still have to set the encoding parameter of readLines to "ANSI" to get the correct file content. An example is "à" coming out as "\xe0" under UTF-8 encoding, but correctly under ANSI encoding when using readLines of the file created – dimpol Nov 11 '16 at 11:03
  • @Curious - No, I ended up doing it manually using notepad++. I only needed to do it once for the files in one dataset and it was faster just to bite the bullet and do it manually then to keep messing with R file-encodings. – dimpol May 09 '17 at 09:38
8

For anyone coming upon this question later, see the stringi package (https://cran.r-project.org/web/packages/stringi/index.html). It includes numerous functions to enable consistent, cross-platform UTF-8 string support in R. Most relevant to this thread, the stri_read_lines(), stri_read_raw(), and stri_write_lines() functions can consistently input/output UTF-8, even on Windows.

sm925
  • 2,648
  • 1
  • 16
  • 28
Brenton Wiernik
  • 346
  • 3
  • 4
0

I think you are having problems because write is constructed so that it takes the name of an object and you do not appear to have build such a named object. Try this instead:

txt <- "在"
rty <- file("test.txt",encoding="UTF-8")
write(txt, file=rty)
close(rty)
rty <- file("test.txt",encoding="UTF-8")
 inp <- scan(rty,what=character())
#Read 1 item
 close(rty)
 inp
#[1] "在"
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Hm, the original application that inspired the minimal snippet above used named objects. Moreover the code you provide above produces the same result for me as above. Perhaps I have a native encoding issue? – Patrick May 21 '12 at 02:39
0

I have such problem with UTF-8 strings which come from DB.

The only way I've found to save them properly is saving file in binary mode.

  F <- file(file.name, "wb")
  tryCatch({
    writeBin(charToRaw(the_utf8_str), F)
  },
  finally = { 
    close(F)
  })
beloblotskiy
  • 948
  • 9
  • 7