9

I'm using R 3.1.1 on Windows 7 32bits. I'm having a lot of problems reading some text files on which I want to perform textual analysis. According to Notepad++, the files are encoded with "UCS-2 Little Endian". (grepWin, a tool whose name says it all, says the file is "Unicode".)

The problem is that I can't seem to read the file even specifying that encoding. (The characters are of the standard spanish Latin set -ñáó- and should be handled easily with CP1252 or anything like that.)

> Sys.getlocale()
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"
> readLines("filename.txt")
 [1] "ÿþE" ""    ""    ""    ""   ...
> readLines("filename.txt",encoding="UTF-8")
 [1] "\xff\xfeE" ""          ""          ""          ""    ...
> readLines("filename.txt",encoding="UCS2LE")
 [1] "ÿþE" ""    ""    ""    ""    ""    ""     ...
> readLines("filename.txt",encoding="UCS2")
 [1] "ÿþE" ""    ""    ""    ""    ...

Any ideas?

Thanks!!


edit: the "UTF-16", "UTF-16LE" and "UTF-16BE" encondings fails similarly

s_a
  • 885
  • 3
  • 9
  • 22
  • 1
    `'\xff\xfe'` is the `UTF-16LE` encoding of the byte order mark (BOM) character. Decoding with UTF-8 should fail as FFh is an invalid start byte, but I'm not familiar with R. – Mark Tolonen Oct 11 '14 at 03:57
  • 1
    I've had similar struggles with encoding. Had more success with `scan` than I did `readLines`. Try `scan("filename.txt", fileEncoding="UCS-2LE", sep="\n")` – Paul Regular Oct 11 '14 at 13:47
  • Thanks for answering. I think I sould report this as a bug, right? `scan` does read the file (and I don't understand the difference between the `fileEncoding` and `encoding` params), but it creates other problems. First, it only takes "one byte separators", and if you use an absurd separator it falls back to space as a sep. Also, it strips the \r\n that I need to preserve. And finally, for some reason `paste` fails to concatenate the string (it just returns the original vector). – s_a Oct 14 '14 at 13:01

1 Answers1

17

After reading more closely to the documentation, I found the answer to my question.

The encoding param of readLines only applies to the param input strings. The documentation says:

encoding to be assumed for input strings. It is used to mark character strings as known to be in Latin-1 or UTF-8: it is not used to re-encode the input. To do the latter, specify the encoding as part of the connection con or via options(encoding=): see the examples. See also ‘Details’.

The proper way of reading a file with an uncommon encoding is, then,

filetext <- readLines(con <- file("UnicodeFile.txt", encoding = "UCS-2LE"))
close(con)
s_a
  • 885
  • 3
  • 9
  • 22
  • 3
    Thanks this worked for me. I used: `hht9aa <- read.csv(file("hht9aa_aa.txt",encoding="UCS-2LE"))` And finally got it to read UTF-16 Little Endian files correctly. But I did not have to close(con), in fact I got an error when I did so, and eventually left it out. – Mike Wise Apr 14 '15 at 11:53