2

I'm loading a Google search page with a preset search term ("Apples"). Then I want to type into the search box to find something else, but it doesn't behave as expected (detailed description below the code).

var links = [];
var casper = require('casper').create({
    // verbose: true, 
    // logLevel: "debug" 
    // pageSettings: {
    //  userAgent: 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5'
    // }
});

function getLinks() {
    var links = document.querySelectorAll('h3.r a');
    return Array.prototype.map.call(links, function(e) {
        return e.innerText;
    });
}

casper.start('https://www.google.com/#safe=off&q=Apples', function() {
    // search for 'casperjs' from google form
    this.fill('form[action="/search"]', { q: 'casperjs' }, true);
    casper.capture('screenshot/googleresults1.png');

});

casper.then(function() {
    // aggregate results for the 'casperjs' search
    links = this.evaluate(getLinks);
    casper.capture('screenshot/googleresults2.png');
    // now search for 'phantomjs' by filling the form again
    this.fill('form[action="/search"]', { q: 'phantomjs' }, true);

});

casper.then(function() {
    // aggregate results for the 'phantomjs' search
    links = links.concat(this.evaluate(getLinks));
});

casper.run(function() {
    // echo results in some pretty fashion
    this.echo(links.length + ' links found:');
    casper.capture('screenshot/googleresults3.png');
    this.echo(' - ' + links.join('\n - ')).exit();
});

The bugs I experienced:

  • Including User Agent in .create() gives me no results in console.
  • Commenting out User Agent but including Verbose and Loglevel,gives me "Apples" results
  • Commenting out everything gives me the right results (Casperjs and Phantomjs)

My questions:

  1. I don't understand why turning on both Verbose and LogLevel gives me "Apples" results as you can see in the casper.start function.
  2. Why does turning on User Agent give me 0 results?

Is anyone else getting this? As you see, the right results should be Casperjs and Phantomjs through both the fill functions entered in the search box.

Screenshots of my 3 captures Screenshot1
Screenshot2
Screenshot3

After repeating the program in my console a few times, on some occasions, it appears the 1st fill action does not proceed. therefore, it scrapes Apple. However, I wonder why is this? Should I change to use another function instead?

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
Ming
  • 332
  • 4
  • 17

1 Answers1

2

Google delivers different pages depending on the user agent, viewport size and other metrics.

The different pages can manifest themselves in additional JavaScript which does not run correctly in PhantomJS (clicking and submitting stuff is always a problem). It is also possible that elements are added, removed or their IDs changed between different configurations (user agent, viewport size).

You should take screenshots (casper.capture(filename)) and safe the current page source (fs.write(filename, casper.getHTML())) to see whether there are differences compared to what you see in your desktop browser.


Specific issues in your script:

  • If there is no page load, then you should use one of the casper.wait* functions to wait for the changed content. casper.then() is a asynchronous step function that usually only catches full page loads.
    On that note, casper.fill() is finishes immediately, but the page may take a while until the typed in content is actually loaded. Therefore, using casper.capture() immediately after casper.fill() will not give the intended result.

  • this inside of a CasperJS function always refers to casper. So, you can use them interchangeably.

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
  • 1
    I'd recommend not starting with a Google page when starting out with scraping/automating. – Artjom B. Apr 03 '16 at 12:56
  • Thanks very for the help. I did what you suggested and updated my question with the screenshots. It appears that the first fill function with my query CasperJs does not go through. Thus, the program scrapes "Apple". Also, when I included my userAgent, they whole page does not load up when screenshot. Is there something wrong with it. Thanks very much. Should I use this.getHTML() or casper.getHTML()? – Ming Apr 03 '16 at 13:29
  • Thanks very much for the feedback. I also have another question on CasperJS. http://stackoverflow.com/questions/36386601/casperjs-not-returning-google-search-link-titles-but-screenshot-source-code-te Hope you can see that as well. Thanks. – Ming Apr 03 '16 at 14:29