2

I am working on a scrapy app to scrapte some data on a web page

But there is some data loaded by ajax, and thus python just cannot execute that to get the data.

Is there any lib that simulate the behavior of a browser?

enzo
  • 29
  • 1
  • 1
  • 2

5 Answers5

6

For that you'd have to use a full-blown Javascript engine (like Google V8 in Chrome), to get the real functionality of the browser and how it interacts. However, you could possibly get some information by looking up all URLs in the source and doing a request to each, hoping for some valid data. But in overall, you're stuck without a full Javascript engine.

Something like python-spidermonkey. A wrapper to the Javascript engine of Mozilla. However using it might be rather complicated, but that's dependant on your specific application.

You'd basically have to build a browser, but seems Python-people have made it simple. With PyWebkitGtk you'd get the dom and using either python-spidermonkey mentioned before or PyV8 mentioned by Duncan you'd theoretically get the full functionality needed for a browser/webscraper.

zatatatata
  • 4,761
  • 1
  • 20
  • 41
3

The problem is that you don't just have to be able to execute some Javascript (that's easy), you also have to emulate the browser DOM and that's a lot of work.

If you want to be able to run Javascript then you can use PyV8. Install it with easy_install PyV8 and then you can execute any standalone javascript:

>>> import PyV8
>>> ctxt = PyV8.JSContext()
>>> ctxt.enter()
>>> ctxt.eval("(function(a,b) { return [a+b, a*b, a/b, a-b] })(13,29)")
<_PyV8.JSArray object at 0x01F26A30>
>>> list(_)
[42, 377, 0.4482758620689655, -16]

You can also pass in classes defined in Python, so in principle might be able could emulate enough of the DOM for your purposes.

Duncan
  • 92,073
  • 11
  • 122
  • 156
2

The simplest way to get work done is by using 'PyExecJS'. https://pypi.python.org/pypi/PyExecJS

PyExecJS is a porting of ExecJS from Ruby. PyExecJS automatically picks the best runtime available to evaluate your JavaScript program, then returns the result to you as a Python object.

I use Macbook and installed node.js so pyexecjs can use node.js javascript runtime.

pip install PyExecJS

Test code:

import execjs execjs.eval("'red yellow blue'.split(' ')")

Good luck!

Bob Yang
  • 123
  • 5
2

An AJAX request is a normal web request which is executed asynchronously. All you need is the URL which the JavaScript code sends to the server. Use that URL with urllib to get at the same data.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • That begs the question of how one should /get/ that URL! – Arafangion Jul 26 '11 at 07:39
  • There's a huge problem with this solution, specifically there's a very small chance that you'll actually get any information from the server, because in most cases, requests are built based on user interaction, variables etc that all need to be present and processed by Javascript scripts. – zatatatata Jul 26 '11 at 07:39
  • @Arafangion: Look at the network tab in Firebug or a similar tool. – Aaron Digulla Jul 26 '11 at 07:42
  • @Zanfa: Those are all part of the request. You only get into trouble if something must exist in the user session on the server. The solution for that is to record all the requests sent from the browser to the server and repeat them in your Python script. – Aaron Digulla Jul 26 '11 at 07:43
  • That's not the needed implementation he described. You'd have to manually scout every website you wish to scrape and even then you can't be sure that the same URLs work the next time, because, like it or not, a lot (and by a lot I mean most major) websites have session specific variables in the URL anyway, so this solution would work only on specific sites, requiring lots of manual work also. Even a mere timestamp would ruin your requests. – zatatatata Jul 26 '11 at 07:50
  • @Zanfa: The only generic solution would include a web browser and a human. Most of the time, my approach works because web developers try to keep their servers simple. And OP wanted to do this for a single web page. – Aaron Digulla Jul 26 '11 at 14:23
  • That's BS solution, not simple. In your case you should have just asked him to use Chrome/FF etc and copy-paste the required data (aka human and a browser), no need to write an automated script that does exactly what you did manually a minute ago and won't probably be usable in 10 minutes. – zatatatata Jul 26 '11 at 14:26
1

A little update for 2020

i reviewed the results recommended by Google Search. Turns out the best choice is #4 and then #1 #2 are deprecated!

Deprecated

  • #1 https://github.com/sony/v8eval has received very little maintenance for 2 years. takes 20 minutes to build and fails... (filed a bug report for that https://github.com/sony/v8eval/issues/34
  • #2 https://github.com/doloopwhile/PyExecJS is discontinued by the owner

enter image description here

Frederic Bazin
  • 1,530
  • 12
  • 27