3

I scrape some information from website like this:

library(rvest)
x<-"http://www.transfermarkt.com/wasserman-media-group/beraterfirmenuebersicht/berater?sort=gesamtwert.desc"
read_html(x) -> web    
web %>%
html_nodes('.spielprofil_tooltip') %>% 
html_text() -> footballers
footballers

The output looks like this:

 [1] "Cristiano Ronaldo"    "James RodrĂ­guez"     "Ăngel Di MarĂ­a"     "Diego Costa"          "William Carvalho"     "Eliaquim Mangala"     "Ezequiel Garay"       "Falcao"              
 [9] "Thiago Silva"         "André Gomes"         "João Moutinho"       "Carlos Vela"          "Bernardo Silva"       "Fábio Coentrão"     "Pepe"                 "Ivan Cavaleiro"      
[17] "Giovani dos Santos"   "Miguel Veloso"        "Danilo"               "Pizzi"                "Anwar El Ghazi"       "Rúben Neves"         "Adrián López"       "Ahmed Hassan"        
[25] "Danny"                "Nélson Oliveira"     "Ricardo Quaresma"     "Gonçalo Guedes"      "Nélson Semedo"       "Wallace"              "Anderson"             "Bruno Gama"          
[33] "Sidnei"               "Hugo Viana"           "Hélder Costa"        "Tiago"                "Bruno Alves"          "Bebé"                "José Sá"            "Hélder Postiga"     
[41] "Simão"               "José Bosingwa"       "Ederson"              "Duda"                 "André Geraldes"      "Pelé"                "Filipe Oliveira"      "Diogo Jota"          
[49] "Burgui"               "Edinho"               "Alberto Rodríguez"   "Moreno"               "Ricardo Carvalho"     "Tiago Sá"            "Vítor Gomes"         "Mário Sérgio"      
[57] "Rafael Márquez"      "Júlio Alves"         "Marcão"              "Cândido Costa"       "Diego Oliveira"       "Rafa"                 "Valdir"               "César Peixoto"      
[65] "Ricardo Carvalho"     "Jorge Ribeiro"        "Lucas Ferrugem"       "Nunes"                "Pedrinha"             "Dong-Hyun Kim"        "Wênio"               "Henrique Hilário"   
[73] "Jorge Andrade"        "Derlei"               "Abel"                 "Petit"                "Costinha"             "Nuno Espírito Santo" "Paulo Ferreira"       "Fábio Faria"        
[81] "Deco"                 "Jorge LuĂ­s"          "JoĂŁo Alves"          "Fabiano Rossato"      "Mantorras"            "Bruno"                "Bruno Tiago"          "LuĂ­s Loureiro"      
[89] "Xadas"                "VitĂł"    

As you might see there is some problem with encoding, therefore I use following statement:

repair_encoding(footballers)
Best guess: UTF-8 (100% confident)
 [1] "Cristiano Ronaldo"   "James Rodríguez"     "Ángel Di María"      "Diego Costa"         "William Carvalho"    "Eliaquim Mangala"    "Ezequiel Garay"      "Falcao"             
 [9] "Thiago Silva"        "André Gomes"         "Jo\032o Moutinho"    "Carlos Vela"         "Bernardo Silva"      "Fábio Coentr\032o"   "Pepe"                "Ivan Cavaleiro"     
[17] "Giovani dos Santos"  "Miguel Veloso"       "Danilo"              "Pizzi"               "Anwar El Ghazi"      "Rúben Neves"         "Adrián López"        "Ahmed Hassan"       
[25] "Danny"               "Nélson Oliveira"     "Ricardo Quaresma"    "Gonçalo Guedes"      "Nélson Semedo"       "Wallace"             "Anderson"            "Bruno Gama"         
[33] "Sidnei"              "Hugo Viana"          "Hélder Costa"        "Tiago"               "Bruno Alves"         "Bebé"                "José Sá"             "Hélder Postiga"     
[41] "Sim\032o"            "José Bosingwa"       "Ederson"             "Duda"                "André Geraldes"      "Pelé"                "Filipe Oliveira"     "Diogo Jota"         
[49] "Burgui"              "Edinho"              "Alberto Rodríguez"   "Moreno"              "Ricardo Carvalho"    "Tiago Sá"            "Vítor Gomes"         "Mário Sérgio"       
[57] "Rafael Márquez"      "Júlio Alves"         "Marc\032o"           "Cândido Costa"       "Diego Oliveira"      "Rafa"                "Valdir"              "César Peixoto"      
[65] "Ricardo Carvalho"    "Jorge Ribeiro"       "Lucas Ferrugem"      "Nunes"               "Pedrinha"            "Dong-Hyun Kim"       "W\032nio"            "Henrique Hilário"   
[73] "Jorge Andrade"       "Derlei"              "Abel"                "Petit"               "Costinha"            "Nuno Espírito Santo" "Paulo Ferreira"      "Fábio Faria"        
[81] "Deco"                "Jorge Luís"          "Jo\032o Alves"       "Fabiano Rossato"     "Mantorras"           "Bruno"               "Bruno Tiago"         "Luís Loureiro"      
[89] "Xadas"               "Vitó" 

Some names were correctly repaired but some spanish signs were not. Does anybody know how to handle the encoding properly in R? I got a similar issue when I deal with polish signs.

Any help would be appreciated!

Marcin
  • 7,834
  • 8
  • 52
  • 99
Michał
  • 273
  • 1
  • 3
  • 13
  • For me on `http://www.transfermarkt.com/wasserman-media-group/beraterfirmenuebersicht/berater?sort=gesamtwert.desc` there are no players... – Rentrop Feb 20 '16 at 00:01
  • http://stackoverflow.com/questions/32833894/r-rvest-is-not-proper-utf-8-indicate-encoding?rq=1 – Benjamin Feb 20 '16 at 00:38
  • @Floo0 you are right. the proper site would be for example: http://www.transfermarkt.com/wasserman-media-group/beraterfirmenuebersicht/berater?sort=gesamtwert.desc – Michał Feb 22 '16 at 10:59
  • @Benjamin changing encoding to utf-8 (as the source of website suggests) does not fix the problem. – Michał Feb 22 '16 at 11:06
  • I should also say that after use of repair_encoding fuction I got warning message: In stringi::stri_conv(x, from = from) : the Unicode codepoint \U000000e3 cannot be converted to destination encoding – Michał Feb 23 '16 at 14:06

0 Answers0