2

I'm using CasperJS for web scraping, but I encountered some problems on scraping the page I describe below.

The html of the page looks like this:

<img id="trigger">
<img id="cur_img_xxx" class="show">
<img id="cur_img_yyy" class="cache">

All <img> elements share the same dimensions, and "#trigger" is on the topmost layer. When an image has .show class, it will display on the page; when it's .cache class, it will get downloaded but hide. In this way, when user click on the image, which is actually the trigger, next image will show and a new image will be downloaded via AJAX. The resulted html becomes:

<img id="trigger">
<img id="cur_img_xxx" class="cache">
<img id="cur_img_yyy" class="show">
<img id="cur_img_zzz" class="cache">

I guess it's a good strategy to increase the UX, and good for avoiding web scraping, but I still want to scrape :P

I tried $("#trigger").click() in the web console, and the images get navigated and downloaded corrected. However, when I tried to simulate this process using CasperJS, neither the navigation nor the image downloading worked. Please refer to the code:

var casper = require ("casper").create({
  clientScripts:  [
    'include/jquery.js'
  ],
  pageSettings: {
    loadImages:  false, // this won't affect since this will only forbid
    loadPlugins: false  // inline imgs from loading, but all imgs in this
  },                    // page are loaded dynamically
  verbose: true
});

casper.start("http://www.example.com/1234.html");

casper.then(function () {
  console.log("Connected! Current Url = " + this.getCurrentUrl());
});

casper.then(function () {
  // findInitialImgs will find imgs that have already been loaded 
  imgs = this.evaluate(findInitialImgs);

  this.waitForSelector("#image_trigger").thenClick("#image_trigger");

  var next = this.evaluate(function () {
    return $("img[id^='cur_img_']").last().attr("href");
  });

  console.log(next);
});

casper.run(function () {
  this.echo('End').exit();
});

By right, after "#trigger" is clicked, the last entry would be different, i.e. from <img id="cur_img_yyy"> becomes <img id="cur_img_zzz">. However, next still held <img id="cur_img_yyy">. Did I do anything wrong?

nevets
  • 4,631
  • 24
  • 40

2 Answers2

1

How do you validate that nothing happens? All wait*() and then*() functions are asynchronous step functions, but evaluate is not, so it is executed before the other two. You need to wrap the last evaluate call in a then block to make sure the step that contains it is executed after clicking.

Since image loading is probably executed asynchronously, you would need to wrap the last evaluate call in a wait block with a short wait time:

casper.then(function () {
  // findInitialImgs will find imgs that have already been loaded 
  imgs = this.evaluate(findInitialImgs);

  this.waitForSelector("#image_trigger")
    .thenClick("#image_trigger")
    .wait(1000, function(){
      var next = this.evaluate(function () {
        return $("img[id^='cur_img_']").last()[0].id;
      });
      console.log(next);
    });
});

Note that you can't pass DOM nodes out of the page context (evaluate()), so you need to use some kind of representation of that. Here I used the id of the last element.

For reference (casper.evaluate() is only a wrapper around PhantomJS' page.evaluate()):

Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.

Closures, functions, DOM nodes, etc. will not work!

Community
  • 1
  • 1
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
  • I've edited the question above to show how I validated nothing happened. I will take your advice to have a try :) – nevets Aug 13 '15 at 07:47
  • I've extended my answer with an example and a fix for a problem that I haven't noticed before. – Artjom B. Aug 13 '15 at 09:09
  • oh it's my fault that didn't express clearly. In the code should be `$("img[id^='cur_img_']").last().attr("href")`. Sorry about this :`( – nevets Aug 13 '15 at 11:10
0

It seems to be JQuery's problem. After I deleted JQuery injection, and changed $("img[id^='cur_img_']").last().attr("href") to

var imgs = document.querySelectorAll("img[id^='cur_img_']");
return imgs[imgs.length - 1].getAttribute("href");

Everything works fine.

Then I found this answer very powerful: CasperJS click event having AJAX call

So confirmed that the original scripts will be broken when you inject JQuery to pages that use $ as JQuery.

Community
  • 1
  • 1
nevets
  • 4,631
  • 24
  • 40