At work I'd like to parse some web pages. Unfortunately I can't add any real page to my example because the urls at work are confident. I can only try to explain what is the problem.
To parse I wrote the following script in R. As a mock url I used www.imdb.com.:
library(rvest)
library(plyr)
# urls
url <- "http://www.imdb.com/"
# parse
html <- try(read_html(url))
# select
select_meta <- function(html) {
html %>%
html_nodes(xpath = "//div") %>%
html_attrs # function to select meta
}
meta <- select_meta(html)
Problem is this script doesn't return anything for the pages I use at work. I guess this is because the scripts are generated by javascript. I found this tutorial which explains how to scrape javascript generated pages in R.
The code used to generate the page in the tutorial is the following:
// scrape_techstars.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'techstars.html'
page.open('http://www.techstars.com/companies/stats/', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
I don't have any Javascript knowledge so I'm having trouble scaling page.open (which only works for 1 page) to multiple pages (at work I have to parse roughly 100 pages). So instead of relying on phantom js I'd rather have a solution which is completely R based (if this is totally inefficient and offensive to real coders, I apologise in advance). So the crux of my question is: "how can I generate several pages in R?".
This is a one-off thing so I'm not really thinking about reading up on Javascript or parsing. Thanks in advance for helping me out.