10

I wish to convert an HTML file encoded in ANSI to UTF-8, using R.

Is there a tool, or a combination of tools, that can make this work?

Thanks.

Edit: o.k, I've narrowed my problem to another one. It is re-posted here: Using "cat" to write non-English characters into a .html file (in R)

Community
  • 1
  • 1
Tal Galili
  • 24,605
  • 44
  • 129
  • 187

2 Answers2

22

you can use iconv:

writeLines(iconv(readLines("tmp.html"), from = "ANSI_X3.4-1986", to = "UTF8"), "tmp2.html")

tmp2.html should be utf-8.


Edit by Henrik in June 2015:
A working solution for Windows distilled from the comments is as follows:

writeLines(iconv(readLines("tmp.html"), from = "ANSI_X3.4-1986", to = "UTF8"), 
           file("tmp2.html", encoding="UTF-8"))

Update 2021: And if ANSI is the current locale, the following works as well (i.e., uses the local encoding as from source):

writeLines(iconv(readLines("tmp.html"), from = "", to = "UTF8"), 
           file("tmp2.html", encoding="UTF-8"))
Henrik
  • 14,202
  • 10
  • 68
  • 91
kohske
  • 65,572
  • 8
  • 165
  • 155
  • But what with html headers? Shouldn't be changed either? – Marek Sep 20 '11 at 08:58
  • Thanks Kohske, but this doesn't work for me. It will convert the text in the file, but in some weird way, not the file itself. When I used notepad++ to look at the encoding, it is still ANSI, and only through notepad++ can I change it to UTF8 (your code won't do it). Any suggestions? :) – Tal Galili Sep 20 '11 at 09:04
  • 3
    How about changing `from = "CP1252"` ? – kohske Sep 20 '11 at 09:28
  • Kohske - this is indeed the correct encoding to use. But when I read the file into R, it interprets the text correctly. I'll try to update my question to better explain... – Tal Galili Sep 20 '11 at 10:11
  • @TalGalili You need to define `file` connection with proper encoding (see `?file`). Something like `f<-file("tmp2.html", encoding="UTF-8")` and then `writeLines(....., f)`. – Marek Sep 20 '11 at 10:22
  • Thanks Marek. This looks to be in the right direction, but no success yet. Please continue this on the new thread I started (which has an updated question): http://stackoverflow.com/questions/7483742/using-cat-to-write-non-english-characters-into-a-html-file-in-r – Tal Galili Sep 20 '11 at 10:38
  • 1
    What does your test html file contain? From `?Encoding`: "ASCII strings will never be marked with a declared encoding, since their representation is the same in all supported encodings." Also try `useBytes = TRUE` in the call to `writeLines`. – Richie Cotton Sep 20 '11 at 10:39
0

I had some problems with the solutions proposed above, especially with the TAB character. This alternative never disappointed me. Unfortunately it only works on UNIX-like systems.

system('iconv -f CP1252 -t UTF-8 < tmp.html > tmp2.html')
ExaFusion
  • 11
  • 1
  • 3