What's the issue?
Your issue here is in correctly determining the encoding of the webpage.
The good news
Your approach looks like a good one to me since you looked at the source code and found the Meta charset, given as ISO-8895-1. It is certainly ideal to be told the encoding, rather than have to resort to guess-work.
The bad news
I don't believe that encoding exists. Firstly, when I search for it online the results tend to look like typos. Secondly, R provides you with a list of supported encodings via iconvlist()
. ISO-8895-1 is not in the list, so entering it as an argument to read_html isn't useful. I think it'd be nice if entering a non-supported encoding threw a warning, but this doesn't seem to happen.
Quick solution
As suggested by @MrFlick in a comment, using encoding = "latin1"
appears to work.
I suspect the Meta charset has a typo and it should read ISO-8859-1 (which is the same thing as latin1).
Tips on guessing an encoding
What is your browser doing?
When loading the page in a browser, you can see what encoding it is using to read the page. If the page looks right, this seems like a sensible guess. In this instance, Firefox uses Western encoding (i.e. ISO-8859-1).
Guessing with R
rvest::guess_encoding
is a nice, user-friendly function which can give a quick estimate. You can provide the function with a url e.g. guess_encoding(url)
, or copy in phrases with more complex characters e.g. guess_encoding("Situação do Termo/Convênio:")
.
One thing to note about this function is it can only detect from 30 of the more common encodings, but there are many more possibilities.
As mentioned earlier, iconvlist()
provides a list of supported encodings. By looping through these encodings and examining some text in the page to see if it's what we expect, we should end up with a shortlist of possible encodings (and rule many encodings out).
Sample code can be found at the bottom of this answer.
Final comments
All the above points towards ISO-8859-1 being a sensible guess for the encoding.
The page url contains a .br extension indicating it's Brazilian, and - according to Wikipedia - this encoding has complete language coverage for Brazilian Portuguese, which suggests it might not be a crazy choice for whoever created the webpage. I believe this is also a reasonably common encoding type.
Code
Sample code for 'Guessing with R' point 2 (using iconvlist()
):
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
# 1. See which encodings don't throw an error
read_page <- lapply(unique(iconvlist()), function(encoding_attempt) {
# Optional print statement to show progress to 1 since this can take some time
print(match(encoding_attempt, iconvlist()) / length(iconvlist()))
read_attempt <- tryCatch(expr=read_html(url, encoding=encoding_attempt),
error=function(condition) NA,
warning=function(condition) message(condition))
return(read_attempt)
})
names(read_page) <- unique(iconvlist())
# 2. See which encodings correctly display some complex characters
read_phrase <- lapply(x, function(encoded_page)
if(!is.na(encoded_page))
html_text(html_nodes(encoded_page, ".dl-horizontal:nth-child(1) dt")))
# We've ended up with 27 encodings which could be sensible...
encoding_shortlist <- names(read_phrase)[read_phrase == "Situação:"]