3

I would like to scrape the Vancouver olympic games Wikipedia entry. Unfortunately its not a nice table format.

I am trying to create a data frame with 2 columns: Nation and number of athletes.

At this point I have

library(XML)
library(RCurl)

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue)
country<-tablehead[31]

where country is

> country
[1] "\n Afrique du Sud (2)\n Albanie (1)\n Algérie (1)\n Allemagne (153)\n Andorre (6)\n Argentine (7)\n Arménie (4)\n Australie (41)\n Autriche (82)\n Azerbaïdjan (2)\n Belgique (8)\n Bermudes (1)\n Biélorussie (50)\n Bosnie-Herzégovine (5)\n Brésil (5)\n Bulgarie (18)\n Canada (206)\n Chili (3)\n Chine (90)\n Chypre (2)\n Colombie (1)\n\n\n\n Corée du Nord (2)\n Corée du Sud (46)\n Croatie (18)\n Danemark (18)\n Espagne (18)\n Estonie (32)\n États-Unis (216)\n Éthiopie (1)\n Finlande (95)\n France (108)\n Géorgie (12)\n Ghana (1)\n Grande-Bretagne (52)\n Grèce (7)\n Hong Kong (1)\n Hongrie (16)\n Îles Caïmans (1)\n Inde (3)\n Iran (4)\n Irlande (6)\n Islande (4)\n\n\n\n Israël (3)\n Italie (109)\n Jamaïque (1)\n Japon (94)\n Kazakhstan (38)\n Kirghizistan (2)\n Lettonie (54)\n Liban (3)\n Liechtenstein (6)\n Lituanie (6)\n Macédoine (3)\n Moldavie (8)\n Maroc (1)\n Mexique (1)\n Monaco (3)\n Monténégro (1)\n Mongolie (2)\n Népal (1)\n Norvège (99)\n Nouvelle-Zélande (16)\n\n\n\n Ouzbékistan (3)\n Pakistan (1)\n Pays-Bas (34)\n Pérou (3)\n Pologne (50)\n Portugal (1)\n République tchèque (93)\n Roumanie (29)\n Russie (179)\n Saint-Marin (1)\n Sénégal (1)\n Serbie (10)\n Slovaquie (73)\n Slovénie (49)\n Suède (108)\n Suisse (146)\n Tadjikistan (1)\n Taipei chinois (1)\n Turquie (5)\n Ukraine (47)\n\n"

I have tried

str_detect(country,"\n")
country<-str_split(country,"\n")

But the data are very dirty, and it's not working well. Any suggestions?

Andy Clifton
  • 4,926
  • 3
  • 35
  • 47
delaye
  • 1,357
  • 27
  • 45

2 Answers2

1

A possibility is to use regular expressions. I've never done that with R but the library stringr seems to be recommended: Extract a regular expression match in R version 2.10 ( http://cran.r-project.org/web/packages/stringr/stringr.pdf )

EDIT: Code that appears to work for me

library(XML)
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
library(stringr)

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE, encoding = "UTF-8")
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue)
country<-tablehead[31]

country<-strsplit(country,"\n")

# extract country
bar <- function(x) str_trim(str_extract(x, "[^(]*"), side = "both")
res1 <- sapply(country[[1]], bar)    
# extract nb of athletes
foo <- function(x) str_trim(str_match(x, "\\((.*?)\\)")[[2]], side = "both")
res2 <- sapply(country[[1]], foo)
# build df
res2 <- as.numeric(res2)
df <- data.frame(res1, res2)
df <- df[res1 != "",]
# inspect df
nrow(df)
summary(df)
Community
  • 1
  • 1
etna
  • 1,083
  • 7
  • 13
  • Hi etna, this seems to be something like that! But at this points ... I have 1 observation in a data frame of 2 variables. – delaye Mar 27 '14 at 19:15
  • Strange, I reproduce the whole code as it appears to work here and add some lines which inspect the result. – etna Mar 28 '14 at 07:53
0

Try

library(plyr)
country <- str_split(country,"\n")[[1]]
df <- ldply(country[[1]], function(z) data.frame(str_extract(z, "[A-Za-z]+")[[1]], str_extract(z, "[0-9]+")))
head(na.omit(df))

                                  a                        b
2                           Afrique                        2
3                           Albanie                        1
4                               Alg                        1
5                         Allemagne                      153
6                           Andorre                        6
7                         Argentine                        7
sckott
  • 5,755
  • 2
  • 26
  • 42
  • Thank's Scott but it doesn't work at home `head(na.omit(df))` exit `...[1] str_extract.z....A.Za.z......1.. str_extract.z....0.9.... <0 lignes> (ou 'row.names' de longueur nulle)` – delaye Mar 27 '14 at 19:13
  • What does the object df look like? – sckott Mar 27 '14 at 19:23
  • `str_extract.z....A.Za.z......1.. str_extract.z....0.9.... 1 ` – delaye Mar 27 '14 at 19:27