2

can someone help me or give me some suggestion how scrape table from this url: https://www.promet.si/portal/sl/stevci-prometa.aspx.

I tried with instructions and packages rvest, httr and html but for this particular site without any sucess. Thank you.

user8795501
  • 73
  • 10
  • 1
    Are you sure you're allowed to scrape that page? Could you post your attempts? – s__ Oct 17 '18 at 13:40
  • So, what is the problem? What is your code and errors you've got? – Vladimir Volokhonsky Oct 17 '18 at 13:52
  • 1
    @s_t [`robots.txt`](https://www.promet.si/robots.txt) _seems_ like it does but I can't read any terms of service to know for sure. That site uses a relatively up-to-date sharepoint back-end which severely obfuscates the dynamic page resource loads and display. You will likely have to use splashr or RSelenium&friends to get the content – hrbrmstr Oct 17 '18 at 15:39
  • 1
    Wow. That site is truly evil. The XHR `POST` posts a base64 encoded value from a computed sharepoint viewstate and the response is binary content that custom javascript decodes. You will definitely want to use splashr or RSelenium and make sure to give a bit of a wait on the page and possibly virtually move the mouse as i believe there's some javascript that checks for a human. – hrbrmstr Oct 17 '18 at 15:48
  • Also, don't leave that site up in your browser. It has a few eavesdrop scripts and it tries to refresh that table every minute or so, pulling in over 1 MB each time. However, https://www.promet.si/portal/sl/etd.aspx says they have APIs which you may also want to investigate. – hrbrmstr Oct 17 '18 at 15:48
  • @hrbrmstr, I've read it, but I'm learning now how to understand it (your comment helps me), so my question was not only an advice, but also a true question that you have answered, thanks a lot. However I thought that reading that `robots.txt` is necessary and sufficient to see if a site is "scrapable", but your advice to find also explicit policy is nice. – s__ Oct 17 '18 at 20:59

2 Answers2

5

This ought to help get you started:

library(RSelenium)
library(wdman)
library(seleniumPipes)
library(rvest)
library(tidyverse)

selServ <- selenium(verbose = FALSE)
selServ$log() # find the port
remDr <- remoteDr(browserName = "chrome", port = 4567L)

remDr %>% 
  go("https://www.promet.si/portal/sl/stevci-prometa.aspx")

Sys.sleep(5)

pg <- getPageSource(remDr)

html_node(pg, xpath=".//div[@id='ctl00_mainContent_ctl00_StvContainer']/table") %>% 
  html_table() %>% 
  tbl_df()
## # A tibble: 1,239 x 10
##    X1    X2            X3     X4                       X5     X6      X7     X8    X9     X10  
##    <lgl> <chr>         <chr>  <chr>                    <chr>  <chr>   <chr>  <chr> <chr>  <lgl>
##  1 NA    Lokacija      Cesta  Smer                     Pas    Števil… Hitro… Razm… Stanje NA   
##  2 NA    Ajdovščina    R2-444 vzhod - zahod            ""     60      64     81,7  Norma… NA   
##  3 NA    Ajdovščina    R2-444 zahod - vzhod            ""     12      62     371,6 Norma… NA   
##  4 NA    Ajdovščina 2  R2-444 Ajdovščina - Selo        ""     36      67     117,8 Norma… NA   
##  5 NA    Ajdovščina 2  R2-444 Ajdovščina - Selo        ""     12      60     787,1 Norma… NA   
##  6 NA    Ajdovščina AC HC-H4  Nova Gorica - Vipava     vozni  96      100    31,5  Norma… NA   
##  7 NA    Ajdovščina AC HC-H4  Nova Gorica - Vipava     prehi… 36      124    120,7 Norma… NA   
##  8 NA    Ankaran       R2-406 Križ. Moretini - Ankaran ""     96      59     29    Norma… NA   
##  9 NA    Ankaran       R2-406 Ankaran - Križ. Moretini ""     12      57     292,1 Norma… NA   
## 10 NA    Apače         R2-438 Trate - Gornja Radgona   ""     24      58     110,6 Norma… NA   
## # ... with 1,229 more rows
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
0

The translation of right to use of site "Right to use: All information and images contained on the website www.promet.si are subject to copyright protection and other forms of intellectual property protection. The documents published on these web pages may only be reproduced for non-commercial purposes, and they must also retain all the warnings of copyright or other rights. On every reproduction, the "Traffic Information Center for State Roads" should be listed as a source."

I am not sure if that means that scraping for non-commercial purposes is allowed or not.

Anyway thank you for the warning @s_t and special for the answer with nice code @hrbrmstr.

user8795501
  • 73
  • 10