I would like to scrape the Vancouver olympic games Wikipedia entry. Unfortunately its not a nice table format.
I am trying to create a data frame with 2 columns: Nation
and number of athletes
.
At this point I have
library(XML)
library(RCurl)
path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue)
country<-tablehead[31]
where country is
> country
[1] "\n Afrique du Sud (2)\n Albanie (1)\n Algérie (1)\n Allemagne (153)\n Andorre (6)\n Argentine (7)\n Arménie (4)\n Australie (41)\n Autriche (82)\n Azerbaïdjan (2)\n Belgique (8)\n Bermudes (1)\n Biélorussie (50)\n Bosnie-Herzégovine (5)\n Brésil (5)\n Bulgarie (18)\n Canada (206)\n Chili (3)\n Chine (90)\n Chypre (2)\n Colombie (1)\n\n\n\n Corée du Nord (2)\n Corée du Sud (46)\n Croatie (18)\n Danemark (18)\n Espagne (18)\n Estonie (32)\n États-Unis (216)\n Éthiopie (1)\n Finlande (95)\n France (108)\n Géorgie (12)\n Ghana (1)\n Grande-Bretagne (52)\n Grèce (7)\n Hong Kong (1)\n Hongrie (16)\n Îles Caïmans (1)\n Inde (3)\n Iran (4)\n Irlande (6)\n Islande (4)\n\n\n\n Israël (3)\n Italie (109)\n Jamaïque (1)\n Japon (94)\n Kazakhstan (38)\n Kirghizistan (2)\n Lettonie (54)\n Liban (3)\n Liechtenstein (6)\n Lituanie (6)\n Macédoine (3)\n Moldavie (8)\n Maroc (1)\n Mexique (1)\n Monaco (3)\n Monténégro (1)\n Mongolie (2)\n Népal (1)\n Norvège (99)\n Nouvelle-Zélande (16)\n\n\n\n Ouzbékistan (3)\n Pakistan (1)\n Pays-Bas (34)\n Pérou (3)\n Pologne (50)\n Portugal (1)\n République tchèque (93)\n Roumanie (29)\n Russie (179)\n Saint-Marin (1)\n Sénégal (1)\n Serbie (10)\n Slovaquie (73)\n Slovénie (49)\n Suède (108)\n Suisse (146)\n Tadjikistan (1)\n Taipei chinois (1)\n Turquie (5)\n Ukraine (47)\n\n"
I have tried
str_detect(country,"\n")
country<-str_split(country,"\n")
But the data are very dirty, and it's not working well. Any suggestions?