encoding error with read_html

Question

I am trying to web scrape a page. I thought of using the package rvest. However, I'm stuck in the first step, which is to use read_html to read the content. Here´s my code:

library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
obra_caridade <- read_html(url,
                        encoding = "ISO-8895-1")

And I got the following error:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0x6F 0x20 0x65 [9]

I tried using what similar questions had as answers, but it did not solve my issue:

obra_caridade <- read_html(iconv(url, to = "UTF-8"),
                        encoding = "UTF-8")

obra_caridade <- read_html(iconv(url, to = "ISO-8895-1"),
                        encoding = "ISO-8895-1")

Both attempts returned a similar error. Does anyone has any suggestion about how to solve this issue? Here's my session info:

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_0.3.2 xml2_1.1.1 

loaded via a namespace (and not attached):
[1] httr_1.2.1   magrittr_1.5 R6_2.2.1     tools_3.3.1  curl_2.6     Rcpp_0.12.11

It worked, Thank you? Woul you like to provide as a proper answer, also explaining why using latin1 works, but not ISO-8895-1 (which is the char set as in the source code of the page)? — Manoel Galdino, Jul 24 '17 at 21:55

score 1 · Answer 1 · answered Jul 16 '19 at 15:55

What's the issue?

Your issue here is in correctly determining the encoding of the webpage.

The good news
Your approach looks like a good one to me since you looked at the source code and found the Meta charset, given as ISO-8895-1. It is certainly ideal to be told the encoding, rather than have to resort to guess-work.

The bad news
I don't believe that encoding exists. Firstly, when I search for it online the results tend to look like typos. Secondly, R provides you with a list of supported encodings via iconvlist(). ISO-8895-1 is not in the list, so entering it as an argument to read_html isn't useful. I think it'd be nice if entering a non-supported encoding threw a warning, but this doesn't seem to happen.

Quick solution
As suggested by @MrFlick in a comment, using encoding = "latin1" appears to work.
I suspect the Meta charset has a typo and it should read ISO-8859-1 (which is the same thing as latin1).

Tips on guessing an encoding

What is your browser doing?
When loading the page in a browser, you can see what encoding it is using to read the page. If the page looks right, this seems like a sensible guess. In this instance, Firefox uses Western encoding (i.e. ISO-8859-1).

Guessing with R

rvest::guess_encoding is a nice, user-friendly function which can give a quick estimate. You can provide the function with a url e.g. guess_encoding(url), or copy in phrases with more complex characters e.g. guess_encoding("Situação do Termo/Convênio:").
One thing to note about this function is it can only detect from 30 of the more common encodings, but there are many more possibilities.
As mentioned earlier, iconvlist() provides a list of supported encodings. By looping through these encodings and examining some text in the page to see if it's what we expect, we should end up with a shortlist of possible encodings (and rule many encodings out).
Sample code can be found at the bottom of this answer.

Final comments
All the above points towards ISO-8859-1 being a sensible guess for the encoding.

The page url contains a .br extension indicating it's Brazilian, and - according to Wikipedia - this encoding has complete language coverage for Brazilian Portuguese, which suggests it might not be a crazy choice for whoever created the webpage. I believe this is also a reasonably common encoding type.

Code

Sample code for 'Guessing with R' point 2 (using iconvlist()):

library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"

# 1. See which encodings don't throw an error
read_page <- lapply(unique(iconvlist()), function(encoding_attempt) {

  # Optional print statement to show progress to 1 since this can take some time
  print(match(encoding_attempt, iconvlist()) / length(iconvlist()))

  read_attempt <- tryCatch(expr=read_html(url, encoding=encoding_attempt),
                           error=function(condition) NA,
                           warning=function(condition) message(condition))
  return(read_attempt)
})

names(read_page) <- unique(iconvlist())

# 2. See which encodings correctly display some complex characters
read_phrase <- lapply(x, function(encoded_page) 
  if(!is.na(encoded_page))
    html_text(html_nodes(encoded_page, ".dl-horizontal:nth-child(1) dt")))

# We've ended up with 27 encodings which could be sensible...
encoding_shortlist <- names(read_phrase)[read_phrase == "Situação:"]

encoding error with read_html

1 Answers1

What's the issue?

Tips on guessing an encoding

Code

Linked