1

I want to use selenium or windmill inside google app engine in order to scrape a JS filled website. I know that windmill is written in python and javascript.

Is this possible? If it is, how do insert the library?
If not, could you explain why and provide alternatives?

Thanks.

Update

I searched a little more and saw that scrapy is pure python.
Will that work? Does it handle javascript?

Uri
  • 25,622
  • 10
  • 45
  • 72
  • It is resolved now with the flex functionality of GAE - https://stackoverflow.com/questions/14384062/python-headless-browser-for-gae/51427118#51427118 – Abhishek Gupta Oct 02 '19 at 22:34

3 Answers3

3

Any python "scraping" library is unlikely to be able to interpret the javascript for you on appengine since it would probably require some kind of C-extension (like a binding to spidermonkey or v8) which would be against the GAE sandboxing.

But, if you were to venture over to the Java side you might have more luck. I know that you can get Rhino running on AppEngine, with a little help from env.js you could emulate the DOM, a quick google shows a bunch of scraping tools for Java. It's just a matter of tying it all together.

HtmlUnit Looks like it attempts to do just this, but it is unclear wether it is currently appengine-friendly as it appears to be threaded.

Chris Farmiloe
  • 13,935
  • 5
  • 48
  • 57
1

I believe both Selenium and Windmill only allow you to control a browser, not simulate one. They expect to run in a desktop environment and drive a real browser, which you can't do with App Engine.

You can use the URL Fetch API and an HTML parser like BeautifulSoup to handle page scraping from App Engine.

Drew Sears
  • 12,812
  • 1
  • 32
  • 41
  • 2
    Yes, that works for getting static content. But you need a JavaScript interpreter and full DOM model to get the resulting dynamic page content. – Keith May 08 '11 at 22:26
1

Both Selenium and windmill (which is think is now unmaintaned) are controllers for a real browser. Usually they spawn a real browser (e.g. Firefox) as a subprocess and control it. I don't think you can do that in AppEngine. The closest thing to a pure-code browser that I know of is htmlunit, put that's Java. As far as I know there is no equivalent for Python.

Keith
  • 42,110
  • 11
  • 57
  • 76
  • I was under the impression that they can also be used as libraries as shown here http://www.packtpub.com/article/web-scraping-with-python-part-2 Is this untrue? I thought since it's written in python and JS maybe GAE could run it... Are you sure it can't be done? If indeed htmlunit is the only solution, is there any way to use it with my python code, like a wrapper or adding java alongisde the python code? – Uri May 08 '11 at 22:55
  • For what it's worth, I recently heard there are Python bindings (via Qt) for webkit. Also useless on GAE, but it's a little less than spawning an *entire* real browser, it probably just runs the browser engine. – Steve Jessop May 08 '11 at 22:56
  • @Uri They are also using the real browser subprocess method: "...we will use Windmill as it allows the JavaScript code to execute in the web browser before getting the page content." The example shows them using Firefox. Any JS that Windmill has is run in the browser. – Keith May 09 '11 at 05:15