Scrapy with java content on a web server

Question

I'd like to scrape content from a site which apparently uses a javascript to generate the tables (the site is oddsportal.com).

I see that Scrapy can't load dynamic content, i read selenium could handle it but i'm planning to use a web server.

Is there a way i can parse this site or get the dynamic request and parse it using scrapy?

For example i'd like to import the full table from this page with the headers, match name and odds

http://www.oddsportal.com/matches/handball/

score 0 · Accepted Answer · edited May 23 '17 at 12:04

0

From what I understand, you have a constraint that you don't have a real display. You can still go with selenium - there is a headless PhantomJS browser that can be automated, there is an option to work in a virtual display, and you can use a remote selenium server or docker-selenium.

There are multiple examples on how to combine selenium and scrapy, for instance:

And, also check if scrapy-splash middleware would be enough for your use case.

edited May 23 '17 at 12:04

Community

1
1

answered Jan 28 '16 at 18:01

alecxe

462,703
120
1,088
1,195

So the easiest choice would be Scrapy+PhantomJS+Selenium? – GGA Jan 28 '16 at 18:07
@GGA yup, though, I would first start with trying out scrapy-splash, and then PhantomJS. – alecxe Jan 28 '16 at 18:07
Thanks i'll try, scrapy-splash alone would be enough for a simple one page request? – GGA Jan 28 '16 at 18:09
@GGA basically, it would pass the page through the standalone js engine. Sometimes it's enough to tackle the dynamic page parsing. `PhantomJS` though is the most straight-forward approach here involving less setup. – alecxe Jan 28 '16 at 18:11
How would i use shell and docker for scrapy-splash on a server? – GGA Jan 28 '16 at 23:31
@GGA glad to see you are actually trying out splash. Please elaborate it into a separate question providing all the necessary details. Thanks. – alecxe Jan 28 '16 at 23:54

score 0 · Answer 2 · edited May 23 '17 at 12:15

For sites with dynamic content through AJAX and Javascript, I have used PhantomJS. It doesn't require open a browser because it's in itself a fully scriptable web browser. PhantomJS is fast and includes native support for various web standards as DOM handling, CSS selector, JSON and Canvas.

If you aren't a JavaScript Ninja, You should look CasperJS, it is written over PhantomJS. It eases the process of defining a full navigation scenario and provides useful high-level functions.

Here an example about how CasperJS works:

CasperJs and Jquery with chained Selects

Scrapy with java content on a web server

2 Answers2