0

My site, http://whatgoeswiththis.co, has a scraper that takes images from the web and posts to our site. I can get server rendered images no problem, but for sites like https://www.everlane.com/collections/mens-luxury-tees/products/mens-crew-antique, the images are rendered client-side with javascript.

I've succeeded in writing a script on my local machine that uses ghost.py to scrape the images from this site.

However, I've had to install various programs on my laptop like Qt, PySide, PyQt4, and XQuartz. To my knowledge, these aren't libraries I can just add to my app. My question is, is this stack something that is possible to add to my existing Django app that will allow users to scrape these javascript injected images? Or is there another solution used for webapps?

Sites like http://wanelo.com are able to scrape these images - is there something in particular they're using that is an optimal solution?

Thanks for your help, and I apologize if I sound inexperienced (I am but learning!).

YPCrumble
  • 26,610
  • 23
  • 107
  • 172
  • why don't you use javascript frameworks for this, like jquery? – László Papp Sep 28 '13 at 04:37
  • The images I'm looking to scrape aren't part of the DOM from the get-go. They're rendered client-side. Rather than sending the HTML, the server is sending a js file that the browser reads to create the img html, so unless you run a headless browser to read that output, you won't see tags. – YPCrumble Sep 28 '13 at 17:37

2 Answers2

0

My current answer is: maybe ghost.py works. But only after a lot of prerequisites that I found difficult to install and configure. My solution was to follow the advice of Pyklar to use PhantomJS through the selenium library here: https://stackoverflow.com/a/15699761/2532070.

I was able to switch from beautifulsoup to selenium/phantomjs simply by changing a few lines of code, brew install phantomjs, and pip install selenium.

I hope this helps someone avoid the same struggle!

Community
  • 1
  • 1
YPCrumble
  • 26,610
  • 23
  • 107
  • 172
  • My windows machine simply cannot get phantomjs working. The dev is no longer updating the python bindings for selenium or something something. I'd recommend avoiding phantomjs for python. https://github.com/detro/ghostdriver/issues/236 – Alkanshel Nov 17 '13 at 02:37
0

You can do something like:

g = Ghost()
g.open(url, wait=False)
page, resources = g.wait_for_selector(your_image_css_selector)
jeanphix
  • 174
  • 1
  • 4