1

I'm trying out the "new" Rvest package from Hadley Wickham.

I've used it in the past, so I'd expected that everything run smoothly.

However, I keep seen this error:

> TV_Audio_Video_Marca <- read_html(page_source[[1]], encoding = "ISO-8859-1")
Error: Input is not proper UTF-8, indicate encoding !
Bytes: 0xCD 0x20 0x53 0x2E [9]

As you see in the code, I've use encoding: ISO-8859-1. Before that I was using "UTF-8", but function guess_encoding(page_source[[1]]) says that the encoding is: ISO-8859-1. I've tried with all the options provided by guess_encoding but none worked.

What is the problem?

My code:

library(RSelenium)
library(rvest)
#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

#navigate to your page
remDr$navigate("http://www.linio.com.pe/tv-audio-y-video/televisores/")

#scroll down 5 times, waiting for the page to load at each time
for(i in 1:5){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(3)    
}

#get the page html
page_source<-remDr$getPageSource()

#parse it

TV_Audio_Video_Marca <- read_html(page_source[[1]], encoding = "UTF-16LE")

UPDATE 1

I've googled for "How to now the encoding of a web page?".

Found out this Makrup Validation Tool from W3C, but It wasn't of great help:

http://validator.w3.org/check?uri=http://www.w3.org/2003/10/empty/emptydoc.html

Omar Gonzales
  • 3,806
  • 10
  • 56
  • 120
  • Try: `TV_Audio_Video_Marca <- read_html(iconv(page_source[[1]], to="UTF-8"), encoding = "utf8")` – jeremycg Sep 29 '15 at 01:42
  • Works, please post a complete answer with an explanation for iconv(), and it's use in this case. The documentation did not mention this "trick", or does it? – Omar Gonzales Sep 29 '15 at 01:52

1 Answers1

2

Looking at the page source, they claim to be using UTF-8 encoding:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

So, the question is, are they really using a different enough encoding we need to worry about, or can we just convert to utf-8, guessing that any errors will be negligible?

If you are happy with a quick and dirty approach, and some potential mojibake, you can just force utf-8 using iconv:

TV_Audio_Video_Marca <- read_html(iconv(page_source[[1]], to = "UTF-8"), encoding = "utf8")

In general, this is a bad idea - better to specify the encoding it's from. In this case, maybe the error is theirs, so this quick and dirty approach might be ok.

jeremycg
  • 24,657
  • 5
  • 63
  • 74
  • Is there a tool to validate encoding? As you say they say the use "utf-8", but for example Rvest does not recognize it. I see the page in Spanish and everything is Ok. But how to tell "Are they really using "utf-8"? – Omar Gonzales Sep 29 '15 at 02:14
  • 1
    you did the right thing, as far as I can tell the error is on the websites side. If you do want to be sure you have it correct, `read_html(iconv(page_source[[1]], from = "ISO-8859-1", to = "UTF-8"), encoding = "utf8")` is explicit – jeremycg Sep 29 '15 at 04:17