0

I am using pjscrape to scrape content from dynamic pages generated by a site. Please see code below. I cant figure out what I need to do to get it to print out the url of the scraped page in the json variables dumped to a file. I have tried various ways of doing it - including document.url etc ( see lines 3-6 that are commented out in code below ). However I cant figure out how to get the urlFound variable to get the right value. Of course, the answer might be dead simple but its eluding me. Any other way of doing this? Help!

var scraper = function() {
return {
    //urlFound:$(window.location.href),
    //urlFound: $(this).window.location.href,
    //urlFound: _pjs.toFullUrl($(this).attr('href')),
    //urlFound: _pjs.toFullUrl($(this).URL),
    // Heck - how to print out the url being scraped???
    name: $('h1').text(),
    marin: _pjs.getText($("script:contains('marin')"))
}
};

pjs.config({
    // options: 'stdout', 'file' (set in config.logFile) or 'none'
    log: 'stdout',
    // options: 'json' or 'csv'
    format: 'json',
    // options: 'stdout' or 'file' (set in config.outFile)
    writer: 'file',
    outFile: 'scrape_output.json'
});

pjs.addSuite({
    url: 'http://www.mophie.com/index.html',
    moreUrls: function() {
       return _pjs.getAnchorUrls('li a');
    },
    scraper: scraper
});

2 Answers2

0

Don't need jquery for your selector on window.location.href. Not sure how to get access to the internal url of pjscraper, but changing your code to this works:

var scraper = function() {
    return {
        urlFound: window.location.href,
        name: $('h1').text(),
        marin: _pjs.getText($("script:contains('marin')"))
    }
};
damienfrancois
  • 52,978
  • 9
  • 96
  • 110
0

Or you can just use document.URL...save that as a variable and then write it to a file using How to read and write into file using JavaScript

Community
  • 1
  • 1
maudulus
  • 10,627
  • 10
  • 78
  • 117