1

I am writing a generic selenium phantomjs based spider to access and crawl web pages. Input to the program includes the template which needs to be crawled (css selectors), output should produce the data as according to the template. If we try to crawl the images from a web-site sometimes we might get empty images (that's the case if page source to the time of execution does not include images), which can be solved by wait However more challenging issue occurs when web-page serves placeholders for images, which are later substituted with real image URL's via ajax request.

The question is, how to make sure selenium crawls images only once their real URL's are included into the page. I was thinking of checking src attribute of images for changes, and only after single change should I start parsing page source. However, not sure how this can be implemented? Or if it is a good idea?


EDIT

<html>

<head>
    <style>
    img {
        width: 100%;
        height: auto;
    }
    </style>
</head>

<body>
    <div id='wrapper'>
        <div class='wrapper-child'>
            <img data-backup='./1clr.jpg' src='./1bw.jpg'>
        </div>
        <div class='wrapper-child'>
            <img data-backup='./2clr.jpg' src='./2bw.jpg'>
        </div>
        <div class='wrapper-child'>
            <img data-backup='./3clr.jpg' src='./3bw.jpg'>
        </div>
    </div>
    <script src='./jquery.js'></script>
    <script type='text/javascript'>
    $(document).ready(function() {
        // setTimeout(function() {
            //replace image placeholders
            $.get("ajax/test.html", function(data) {

            }).always(function() {
                $('img').each(function() {
                    $(this).attr('src', $(this).attr('data-backup'));
                });
            });
        // }, 1000);
    });
    </script>
</body>

</html>

Assume I have this page, how can I use selenium to crawl the images after jquery update?

Yerken
  • 1,944
  • 1
  • 14
  • 18

1 Answers1

1

If the site is using jQuery you can check the following to be sure that all the ajax interaction is complete.

jQuery.active == 0

Check this thread for a related question: wait for an ajax call to complete with Selenium 2 web driver

EDIT

This code works for us:

public static int TIME_OUT_SECONDS = 10;
public static int POLLING_MILLISECONDS = 100;

public static final String JS_JQUERY_DEFINED = "return typeof jQuery != 'undefined';";
public static final String JS_JQUERY_ACTIVE = "return jQuery.active != 0;";
public static final String JS_DOC_READY = "return document.readyState != 'complete';";
public static final String JS_BLOCK = "return typeof $ != 'undefined' &&  typeof $.blockSelenium != 'undefined' && $.blockSelenium==true;";


public static void waitForJQuery(final WebDriver driver) {
    new FluentWait<WebDriver>(driver).withTimeout(TIME_OUT_SECONDS, TimeUnit.SECONDS).pollingEvery(POLLING_MILLISECONDS, TimeUnit.MILLISECONDS).until(new Function<WebDriver, Boolean>() {

        @Override
        public Boolean apply(final WebDriver input) {
            boolean ajax = false;
            boolean jQueryDefined = executeBooleanJavascript(input, JS_JQUERY_DEFINED);


            if (jQueryDefined) {
                ajax |= executeBooleanJavascript(input, JS_JQUERY_ACTIVE);
            }

            boolean ready = executeBooleanJavascript(input, JS_DOC_READY);
            boolean block = executeBooleanJavascript(input, JS_BLOCK);

            ajax |= ready;
            ajax |= block;

            // continue if all ajax request are processed
            return !ajax;
        }
    });

}


private static boolean executeBooleanJavascript(final WebDriver input, final String javascript) {
    return (Boolean) ((JavascriptExecutor) input).executeScript(javascript);
}
Community
  • 1
  • 1
narko
  • 3,645
  • 1
  • 28
  • 33
  • Thanks for suggestion, I was looking into trying that option but not really sure about this. Assume there are chained ajax requests, in that case `jQuery.active` might go down to zero even if there will be additional requests going on? Furthermore, is it really usable for checking `GET` image requests? – Yerken Jan 24 '16 at 11:56
  • I believe that if the interaction is done with Ajax then that code can help you. It really depends on the site you are trying to scrape... Regarding your question, jQuery.active won't be zero if there is an active ajax call as far as I know. – narko Jan 24 '16 at 13:16
  • hmm according to my test it really makes no difference if we add waiter for `$.active == 0`Selenium waits for all ajax to be finished by default – Yerken Jan 24 '16 at 15:18
  • Are you sure about that? Did you see that in the Selenium documentation? As far as I know Selenium has implicit waits with a default timeout that is configurable, but I was not aware of that behaviour regarding ajax interaction. A link would be awesome! – narko Jan 24 '16 at 17:09