Web Scraping, extract table of a page

Question

i have extract the table that say "R.U.T" and "Entidad" of the page

http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554

I make the follow code:

library(rvest)
    #put page
    url<-paste("http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554.html",sep="")
     url<-read_html(url)
    #extract table

table<-html_node(url,xpath='//*[@id="listado_fiscalizados"]/table') #xpath
table<-html_table(table)

#transform table to data.frame
table<-data.frame(table)

but R show me the follow result:

> a
{xml_nodeset (0)}

That is, it is not recognizing the table, Maybe it's because the table has hyperlinks?

If anyone knows how to extract the table, I would appreciate it. Many thanks in advance and sorry for my English.

It looks like the table is loaded with JavaScript, so you'll need to grab the HTML via RSelenium or the like. [Here's a recent example](http://stackoverflow.com/a/41497119/4497050) that you should be able to translate directly. — alistaire, Jan 10 '17 at 22:07
I knew about Rselenium, but I wanted to work on another type of solution. Thank you very much for your answer, if I do not find a different solution I will take Rselenium :) — user119144, Jan 11 '17 at 01:39

score 2 · Accepted Answer · answered Jan 11 '17 at 03:40

It makes an XHR request to another resource which is used to make the table.

library(rvest)
library(dplyr)

pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=S&Estado=VI&consulta=CSVID&_=1484105706447")

html_nodes(pg, "table") %>%
  html_table() %>%
  .[[1]] %>%
  tbl_df() %>%
  select(1:2)
## # A tibble: 36 × 2
##        R.U.T.                                            Entidad
##         <chr>                                              <chr>
## 1  99588060-1                           ACE SEGUROS DE VIDA S.A.
## 2  76511423-3                               ALEMANA SEGUROS S.A.
## 3  96917990-3                      BANCHILE SEGUROS DE VIDA S.A.
## 4  96933770-3                          BBVA SEGUROS DE VIDA S.A.
## 5  96573600-K                              BCI SEGUROS VIDA S.A.
## 6  96656410-5                 BICE VIDA COMPAÑIA DE SEGUROS S.A.
## 7  96837630-6            BNP PARIBAS CARDIF SEGUROS DE VIDA S.A.
## 8  76418751-2 BTG PACTUAL CHILE S.A. COMPAÑIA DE SEGUROS DE VIDA
## 9  76477116-8                            CF SEGUROS DE VIDA S.A.
## 10 99185000-7           CHILENA CONSOLIDADA SEGUROS DE VIDA S.A.
## # ... with 26 more rows

You can use Developer Tools in any modern browser to monitor the Network requests to find that URL.

This is the solution I was looking for. I changed the url and xpath in code and it work. Thank you very much. One query, how did you know the table came from a reference? — user119144, Jan 11 '17 at 04:06
"You can use Developer Tools in any modern browser to monitor the Network requests to find that URL.". It's worth the effort to poke at browser "Inspect" / "Inspect Element" / "Developer Tools". Tons of good stuff under the covers of most web pages. — hrbrmstr, Jan 11 '17 at 04:09

score 1 · Answer 2 · answered Jan 10 '17 at 22:16

1

This is the answer using RSelenium:

# Start Selenium Server
RSelenium::checkForServer(beta = TRUE)
selServ <- RSelenium::startServer(javaargs = c("-Dwebdriver.gecko.driver=\"C:/Users/Mislav/Documents/geckodriver.exe\""))
remDr <- remoteDriver(extraCapabilities = list(marionette = TRUE))
remDr$open() # silent = TRUE
Sys.sleep(2)

# Simulate browser session and fill out form
remDr$navigate("http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554.html")
Sys.sleep(2)
doc <- htmlParse(remDr$getPageSource()[[1]], encoding = "UTF-8")

# close and stop server
remDr$close()
selServ$stop()

tables <- readHTMLTable(doc)
head(tables)

answered Jan 10 '17 at 22:16

Mislav

1,533
16
37

You need to show what packages you're loading at the top; it looks like `XML` as well as `RSelenium`. – alistaire Jan 10 '17 at 23:02
Thank you very much for your answer, this works :D. Anyway I will continue to see a solution without RSelenium. – user119144 Jan 11 '17 at 01:57

Web Scraping, extract table of a page

2 Answers2