0

I am using rvest package to extract information from a website in french with accents.

i've tried different encoding methods in my read_hmtl() function, latin1, latin8, utf-8 but all failed.

On top of code source page :

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Here is my code :

dnc_avis <- read_html(url, encoding =  "utf8")
df <- data.frame(dnc_avis %>% html_nodes("div .contenant_recherche h3") %>% html_text(trim=TRUE))
df[1,]

it gives me : Monsieur René (for Monsieur René).

also tried :

dnc_avis <- read_html(iconv(url, to = "UTF-8"), encoding =  "utf8")

but same output.

How can I get a right encoding ?

thanks a lot.

Ouriel
  • 81
  • 8
  • 1
    Are you using an old version of rvest? Try updating it. The current function for parsing an html file is `html()`. Check `?stri_enc_detect` for auto detecting the encoding. That might help. If you provide the actual URL I could give it a shot as well. – Martin Schmelzer Oct 08 '15 at 11:20
  • 1
    When I use your code without specifying any encoding my output is fine: `[1] Monsieur René AUGER`. Try that please. If that also doesn't work (and you updated rvest) then it might be a problem related to your OS and the Locals... – Martin Schmelzer Oct 08 '15 at 12:05
  • weird... im using RStudio 0.99.484 on R i386 3.1.3 and rvest 0.3.0 – Ouriel Oct 08 '15 at 12:53
  • And what operating system? Win, OSX, Unix? – Martin Schmelzer Oct 08 '15 at 13:05
  • win 7 pro (in french), SP1 , 64bits – Ouriel Oct 08 '15 at 13:13
  • If you are not limited to using R, there is a post here about how to solve encoding problem in web scraping http://www.datascraping.co/doc/questions/21/encoding-problem-in-website-scraping – Vikash Rathee Dec 08 '15 at 11:18
  • I think the same question was answered here: http://stackoverflow.com/questions/29379771/utf-8-encoding-problems-with-r – Oliver Dec 18 '15 at 10:17

0 Answers0