Recently, starting from this very useful question (Scraping html tables into R data frames using the XML package) I successfully used the XML
package for scraping HTML tables.
Now I am trying to extract Javascript-generated tables from here: Tables 2013 (then click on "Sortare alfabetică"). I am interested in exporting the first 9 columns of the, say, pag.1-pag.10 data.
I went through different related questions on the forum, including some where it was suggested not to use R to perform such task and a similar question that however did not prove directly useful for my problem. As suggested, I have been reading the information about the Relenium
package (see the developers' toy example here).
According to the structure of the website where the tables of interest are located, I have to click a first button to access the tables sorted by name and then to click a second button to navigate through all the next tables I want to export. In practice I have to:
- click Sortare alfabetică button
- copy the first 9 columns of the 10-row table
- click right button (called Pagina urmatoare)
And repeat 2-3 for 10 times.
By using the Chrome inspector (Tools > Developer tools) I found the following paths for the two buttons:
/html/body/table/tbody/tr[1]/td/table[2]/tbody/tr[2]/td/table/tbody/tr/td[2]/a
/html/body/table/tbody/tr[1]/td/table[3]/tbody/tr/td[4]/table
I started with this code in order to accomplish step 1:
library(relenium)
firefox <- firefoxClass$new()
firefox$get("http://bacalaureat.edu.ro/2013/rapoarte/rezultate/index.html")
buttonElement <- firefox$findElementByXPath("/html/body/table/tbody/tr[1]/td/table[2]/tbody/tr[2]/td/table/tbody/tr/td[2]/a")
buttonElement$click()
But I get the following error:
[1] "Error: NoSuchElementException"
[1] "Thrown by Firefox$findElement(By by) and webElement$findElement(By by)."
I don't know whether it is an easier way to proceed, but an alternative to point 3 to navigate through pag.1-pag.10 can be to work with the dropdown menu of the webpage. The paths for pag.1 and pag.2 are:
//*[@id="PageNavigator"]/option[1]
//*[@id="PageNavigator"]/option[2]
Focusing on scraping data from a single table
Clearly, even before being able to navigate in the 10 tables through the buttons or the scrolldown menu, the crucial problem is to extract the data contained in each table.
With this code I tried to focus on extracting the first 9 columns of the first table only (then the code could be iterated through "http://bacalaureat.edu.ro/.../page_2.html"
, "http://bacalaureat.edu.ro/.../page_3.html"
, etc.):
library(XML)
library(relenium)
firefox <- firefoxClass$new()
firefox$get("http://bacalaureat.edu.ro/2013/rapoarte/rezultate/alfabetic/page_1.html")
doc <- htmlParse(firefox$getPageSource())
tables <- readHTMLTable(doc, stringsAsFactors=FALSE)
But the output is extremely messy. I don't know if this maks sense, and I am only guessing, but it could be necessary to go deeper in the javascript code and extract the information in the table cell by cell.
For instance, for the first individual, the 9 variable values of interest are characterized by the following XPaths:
//*[@id="mainTable"]/tbody/tr[3]/td[1]
//*[@id="mainTable"]/tbody/tr[3]/td[2]
//*[@id="mainTable"]/tbody/tr[3]/td[3]/a
//*[@id="mainTable"]/tbody/tr[3]/td[4]/a
//*[@id="mainTable"]/tbody/tr[3]/td[5]/a
//*[@id="mainTable"]/tbody/tr[3]/td[6]/a
//*[@id="mainTable"]/tbody/tr[3]/td[7]
//*[@id="mainTable"]/tbody/tr[3]/td[8]
//*[@id="mainTable"]/tbody/tr[3]/td[9]
Using these paths, the entries of each cell could be saved into an R
vector and the procedure could be repeated for all the other individual-specific rows of data. Is it sensible to proceed like this? If so, how would you proceed with -relenium-
?