1

I'm not using Selenium to automate testing, but to automate saving AJAX pages that inject content, even if they require prior authentication to access.

I tried

tl;dr: I tried multiple tools for downloading sites with AJAX and gave up because they were hard to work with or simply didn't work. I'm resorting to using Selenium after trying out WebHTTrack (whose GUI wasn't able to start up on my Ubuntu machine + was a headache to provide authentication with in interactive-terminal mode), wget (which didn't download any of the scripts of stylesheets included on my page, see the bottom for what I tried with wget)... and then I finally gave up after a promising post on using a Mozilla XULRunner AJAX scraper called Crowbar simply seg-faulted on me. So...

ended up making my own broken thing in NodeJS and Selenium-WebdriverJS

My NodeJS script uses selenium-webdriver npm module which is "officially supported by the main project" to:

  • provide login information + do necessary button-clicking & typing for authentication
  • download all JS and CSS referenced on target page
  • download target page with original JS/CSS file links change to local file paths

Now when I view my test page locally I see double of many page elements because the target site loads HTML snippets into the page each time it's loaded. I use this to download my target page right now:

var $;
var getTarget = function () {                                                                                                                                               
    driver.getPageSource().then(function (source) {
        $ = cheerio.load(source.toString());
    }); 
};

var targetHtmlDest = 'test.html';
var writeTarget = function () {
    fs.writeFile(targetHtmlDest, $.html());
}

driver.get(targetSite)
    .then(authenticate)
    .then(getRoot)
    .then(downloadResources)
    .then(writeRoot);
driver.quit();   

The problem is that the page source I get is the already modified page source, instead of the original one. Trying to run alert("x");window.stop(); within driver.executeAsyncScript() and driver.executeScript() does nothing.

Meredith
  • 3,928
  • 4
  • 33
  • 58
  • Make an additional separate HTTP request from your scraper (bypassing Selenium) to retrieve just the page source, perhaps? – Anton Strogonoff Aug 07 '14 at 03:57
  • @AntonStrogonoff I would, if I didn't need to do login to get to my page. The point of using Selenium was to automate that authentication. – Meredith Aug 07 '14 at 18:15
  • Voting to close this question now, since [another question](http://stackoverflow.com/questions/6050805/getting-the-raw-source-from-firefox-with-javascript) also asks the same thing. Waiting for a response that actually solves the problem. `innerHTML` gets the _current_ source and not the original source. – Meredith Aug 07 '14 at 19:15
  • OK. Check out some other tools apart from Selenium, though (see http://stackoverflow.com/questions/18539491/headless-browser-and-scraping-solutions). You can e.g. use PhantomJS with disabled JavaScript, so that page source doesn't get modified by scripts. – Anton Strogonoff Aug 08 '14 at 08:51
  • @AntonStrogonoff thanks for the link. I'll probably end up going back to using CasperJS in a bit. – Meredith Aug 14 '14 at 00:18

1 Answers1

0

Perhaps using Curl to get the page (you can pass authentication in the command) will get you the bare source? Otherwise you may be able to turn off JavaScript on your test browsers to prevent JS actions from firing.

DMart
  • 2,401
  • 1
  • 14
  • 19