2

I am trying to web scrape a page. I thought of using the package rvest. However, I'm stuck in the first step, which is to use read_html to read the content. Here´s my code:

library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
obra_caridade <- read_html(url,
                        encoding = "ISO-8895-1")

And I got the following error:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0x6F 0x20 0x65 [9]

I tried using what similar questions had as answers, but it did not solve my issue:

obra_caridade <- read_html(iconv(url, to = "UTF-8"),
                        encoding = "UTF-8")

obra_caridade <- read_html(iconv(url, to = "ISO-8895-1"),
                        encoding = "ISO-8895-1")

Both attempts returned a similar error. Does anyone has any suggestion about how to solve this issue? Here's my session info:

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_0.3.2 xml2_1.1.1 

loaded via a namespace (and not attached):
[1] httr_1.2.1   magrittr_1.5 R6_2.2.1     tools_3.3.1  curl_2.6     Rcpp_0.12.11
MrFlick
  • 195,160
  • 17
  • 277
  • 295
Manoel Galdino
  • 2,376
  • 6
  • 27
  • 40

1 Answers1

1

What's the issue?

Your issue here is in correctly determining the encoding of the webpage.

The good news
Your approach looks like a good one to me since you looked at the source code and found the Meta charset, given as ISO-8895-1. It is certainly ideal to be told the encoding, rather than have to resort to guess-work.

The bad news
I don't believe that encoding exists. Firstly, when I search for it online the results tend to look like typos. Secondly, R provides you with a list of supported encodings via iconvlist(). ISO-8895-1 is not in the list, so entering it as an argument to read_html isn't useful. I think it'd be nice if entering a non-supported encoding threw a warning, but this doesn't seem to happen.

Quick solution
As suggested by @MrFlick in a comment, using encoding = "latin1" appears to work.
I suspect the Meta charset has a typo and it should read ISO-8859-1 (which is the same thing as latin1).


Tips on guessing an encoding

What is your browser doing?
When loading the page in a browser, you can see what encoding it is using to read the page. If the page looks right, this seems like a sensible guess. In this instance, Firefox uses Western encoding (i.e. ISO-8859-1).

Guessing with R

  1. rvest::guess_encoding is a nice, user-friendly function which can give a quick estimate. You can provide the function with a url e.g. guess_encoding(url), or copy in phrases with more complex characters e.g. guess_encoding("Situação do Termo/Convênio:").
    One thing to note about this function is it can only detect from 30 of the more common encodings, but there are many more possibilities.

  2. As mentioned earlier, iconvlist() provides a list of supported encodings. By looping through these encodings and examining some text in the page to see if it's what we expect, we should end up with a shortlist of possible encodings (and rule many encodings out).
    Sample code can be found at the bottom of this answer.

Final comments
All the above points towards ISO-8859-1 being a sensible guess for the encoding.

The page url contains a .br extension indicating it's Brazilian, and - according to Wikipedia - this encoding has complete language coverage for Brazilian Portuguese, which suggests it might not be a crazy choice for whoever created the webpage. I believe this is also a reasonably common encoding type.


Code

Sample code for 'Guessing with R' point 2 (using iconvlist()):

library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"

# 1. See which encodings don't throw an error
read_page <- lapply(unique(iconvlist()), function(encoding_attempt) {

  # Optional print statement to show progress to 1 since this can take some time
  print(match(encoding_attempt, iconvlist()) / length(iconvlist()))

  read_attempt <- tryCatch(expr=read_html(url, encoding=encoding_attempt),
                           error=function(condition) NA,
                           warning=function(condition) message(condition))
  return(read_attempt)
})

names(read_page) <- unique(iconvlist())

# 2. See which encodings correctly display some complex characters
read_phrase <- lapply(x, function(encoded_page) 
  if(!is.na(encoded_page))
    html_text(html_nodes(encoded_page, ".dl-horizontal:nth-child(1) dt")))

# We've ended up with 27 encodings which could be sensible...
encoding_shortlist <- names(read_phrase)[read_phrase == "Situação:"]
hodgenovice
  • 624
  • 3
  • 14