1

I want to extract data from an 'aspx' page (I'm not a specialist of web pages formats) : http://www.ffvoile.fr/ffv/web/pratique/habitable/OSIRIS/table.aspx

More precisely, I want to extract the information for each boat, that we access clicking the 'information' button on the left of the row.

My problem is that the URL is always the same in the case of the 'aspx' page so I don't understand how I can access the information for each boat.

I know how to extract data from a 'standard' web page so how I need to modify the following code (these pages display similar but more limited information on boats that the 'aspx' page) ?

library(rvest)

Url <- "http://www.ffvoile.fr/ffv/public/Application1/Habitable/HN_Detail.asp?Matricule=1"

Page <- read_html(Url)

Data <- Page %>%
html_nodes(".Valeur") %>% # I use SelectorGadget to highlights the relevant elements
html_text()

print(Data)
Kumpelka
  • 869
  • 3
  • 11
  • 33
  • It doesn't matter what the extension of the page is on the URL. They all serve up HTML. You are most likely dealing with a page that updates it's contents with `javascript`. You're going to need to use something that can run javascript like `RSelenium`. Maybe this: https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/ or this: https://www.r-bloggers.com/scraping-javascript-rendered-web-content-using-r/ can help – MrFlick Mar 06 '18 at 22:07
  • you might consider using Fiddler to see what are the http requests when you are surfing the website. then use httr::GET or http:POST to get the equivalent. let me know if u are able to get it – chinsoon12 Mar 07 '18 at 02:17

1 Answers1

2

Assuming that it is not illegal to scrape data from the website, you might consider using the following.

As mentioned in the comment, you can leverage on Fiddler to figure out what are the http requests being made and duplicate those actions.

library(httr)
library(xml2)

website <- "http://www.ffvoile.fr/ffv/web/pratique/habitable/OSIRIS/table.aspx"

#get cookies and and view states
req <- GET(paste0(website, "/js"))
req_html <- read_html(rawToChar(req$content))
fields <- c("__VIEWSTATE","__VIEWSTATEGENERATOR","__VIEWSTATEENCRYPTED",
    "__PREVIOUSPAGE", "__EVENTVALIDATION")
viewheaders <- lapply(fields, function(x) {
    xml_attr(xml_find_first(req_html, paste0(".//input[@id='",x,"']")), "value")
})
names(viewheaders) <- fields

#post data request with index, i starting from 0. You can loop through each row using i
i <- 0
params <- c(viewheaders,
    list(
        "__EVENTTARGET"="ctl00$mainContentPlaceHolder$GridView_TH",
        "__EVENTARGUMENT"=paste0("Select$", i),
        "ctl00$mainContentPlaceHolder$DropDownList_classes"="TOUT",
        "ctl00$mainContentPlaceHolder$TextBox_Bateau"="",
        "ctl00$mainContentPlaceHolder$DropDownList_GR"="TOUT",
        "hiddenInputToUpdateATBuffer_CommonToolkitScripts"=1))
resp <- POST(website, body=params, encode="form", 
    set_cookies(structure(cookies(req)$value, names=cookies(req)$name)))
if(resp$status_code == 200) {
    writeLines(rawToChar(resp$content), "ffvoile.html")
    shell("ffvoile.html")
}   
chinsoon12
  • 25,005
  • 4
  • 25
  • 35