I am writing a generic selenium phantomjs based spider to access and crawl web pages.
Input to the program includes the template which needs to be crawled (css selectors), output should produce the data as according to the template.
If we try to crawl the images from a web-site sometimes we might get empty images (that's the case if page source to the time of execution does not include images), which can be solved by wait
However more challenging issue occurs when web-page serves placeholders for images, which are later substituted with real image URL's via ajax
request.
The question is, how to make sure selenium crawls images only once their real URL's are included into the page. I was thinking of checking src
attribute of images for changes, and only after single change should I start parsing page source. However, not sure how this can be implemented? Or if it is a good idea?
EDIT
<html>
<head>
<style>
img {
width: 100%;
height: auto;
}
</style>
</head>
<body>
<div id='wrapper'>
<div class='wrapper-child'>
<img data-backup='./1clr.jpg' src='./1bw.jpg'>
</div>
<div class='wrapper-child'>
<img data-backup='./2clr.jpg' src='./2bw.jpg'>
</div>
<div class='wrapper-child'>
<img data-backup='./3clr.jpg' src='./3bw.jpg'>
</div>
</div>
<script src='./jquery.js'></script>
<script type='text/javascript'>
$(document).ready(function() {
// setTimeout(function() {
//replace image placeholders
$.get("ajax/test.html", function(data) {
}).always(function() {
$('img').each(function() {
$(this).attr('src', $(this).attr('data-backup'));
});
});
// }, 1000);
});
</script>
</body>
</html>
Assume I have this page, how can I use selenium to crawl the images after jquery update?