2

At work I'd like to parse some web pages. Unfortunately I can't add any real page to my example because the urls at work are confident. I can only try to explain what is the problem.

To parse I wrote the following script in R. As a mock url I used www.imdb.com.:

library(rvest)
library(plyr)

# urls

url <- "http://www.imdb.com/"

# parse

html <- try(read_html(url))

# select

select_meta <- function(html) {
  html %>%
    html_nodes(xpath = "//div") %>%
    html_attrs # function to select meta
}

meta <- select_meta(html)

Problem is this script doesn't return anything for the pages I use at work. I guess this is because the scripts are generated by javascript. I found this tutorial which explains how to scrape javascript generated pages in R.

The code used to generate the page in the tutorial is the following:

// scrape_techstars.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'techstars.html'

page.open('http://www.techstars.com/companies/stats/', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

I don't have any Javascript knowledge so I'm having trouble scaling page.open (which only works for 1 page) to multiple pages (at work I have to parse roughly 100 pages). So instead of relying on phantom js I'd rather have a solution which is completely R based (if this is totally inefficient and offensive to real coders, I apologise in advance). So the crux of my question is: "how can I generate several pages in R?".

This is a one-off thing so I'm not really thinking about reading up on Javascript or parsing. Thanks in advance for helping me out.

1053Inator
  • 302
  • 1
  • 15
  • [This answer](http://stackoverflow.com/a/26681840/1816580) contains some examples of how this can be achieved. If this solves your problem, you can upvote the answer (so I can vote to close as duplicate this and future questions). If you're looking for an R solution, then you should clarify that in your question (and I won't vote to close it). – Artjom B. Nov 27 '15 at 13:16
  • That answer seems to be the solution to my problem but sadly enough I don't understand it due to lack of javascript knowledge. I should have made it clear I was looking for an R solution so I edited the title and question. – 1053Inator Nov 27 '15 at 13:35

0 Answers0