0

I would like to scrape the data of this web site ( http://www.oddsportal.com/matches/soccer ) in order to get a plain text file with the match info and the odds info in this way:

00:30   Criciuma - Atletico-PR                    1:2   2.70    3.24    2.41    
10:45   Vier-und Marschlande - Concordia Hamburg  0:0   4.00    3.53    1.68    
10:45   Germania Schnelsen - ASV Bergedorf 85     2:3   1.95    3.37    3.23    
10:45   Barmbecker SG - Altona                    0:2   3.67    3.37    1.82

I used to do this with w3m, but now it seems that they changed html to javascript and w3m does not work. Data are contained in only one div. this is one entry

<tr xeid="862487"><td class="table-time datet t1333724400-1-1-0-0 ">17:00</td><td class="name table-participant" colspan="2"><a href="/soccer/italy/serie-b-2011-2012/brescia-marmi-lanza-verona-862487/">Brescia - Verona</a></td><td class="odds-nowrp" xoid="40456791" xodd="xzc0fxzxa">-</td><td class="odds-nowrp" xoid="40456793" xodd="cz0ofxz9c">-</td><td class="odds-nowrp" xoid="40456792" xodd="cz9xfcztx">-</td><td class="center info-value">17</td></tr>

What can I do?

emanuele
  • 2,519
  • 8
  • 38
  • 56

2 Answers2

3

The easiest way (maybe not the best though) is to use selenium/watir. In ruby I would do:

require 'watir-webdriver'
require 'csv'
@browser = Watir::Browser.new
@browser.goto 'http://www.oddsportal.com/matches/soccer/'
CSV.open('out.csv', 'w') do |out|
    @browser.trs(:class => /deactivate/).each do |tr|
        out << tr.tds.map(&:text)
    end
end
pguardiario
  • 53,827
  • 19
  • 119
  • 159
2

If they are using Javascript to get data from a service and render it within the DIV, W3M will not show the div updated with that data, because it does not support Javascript.

You have two choices:

  • Reverse-engineer their Javascript to find out where the data is coming from, and see if you can query that data source directly to get the XML or JSON they're using to update the DIV. Then you can skip the scraping entirely. They might not want you doing that, however, and may have secured the data source to prevent it. Or they might not have.

  • Use a browser which executes Javascript before you start your scraping. This way you'll have the div populated with the data. W3M-js might do this for you, or you might want to try something else (lynx or links). This question seems to be related.

ETA: Maybe PhantomJS would help here?

Community
  • 1
  • 1
pjmorse
  • 9,204
  • 9
  • 54
  • 124
  • i don't know how to get data from their service. what do you means with "use a browser which executes Javascript before you start your scraping"? i need to do this in automatic way to collect data at different times. – emanuele Apr 06 '12 at 14:19
  • If you look at the source JS which is building the content in their div, it might indicate where it's getting the data. You could get the same data (in XML or JSON) and skip the scraping if they haven't secured it. As far as the browser goes: because they're using JS to render the data, they're counting on their viewers having JS enabled. W3M does not support JS, so it's not rendering the data. I'll update my answer accordingly. – pjmorse Apr 06 '12 at 14:23
  • w3m-js seems that had disappeared from web :( – emanuele Apr 06 '12 at 14:51
  • I agree with what you say except for the part about securing the data. If you can see the data in a browser, then you can scrape it. – pguardiario Apr 06 '12 at 19:09
  • Maybe. I can imagine the service being set up to require certain criteria (e.g. a cookie or similar session token) in the request; such criteria could certainly be imitated or spoofed somehow, but it would make regularly sipping data from the service somewhat less simple. – pjmorse Apr 06 '12 at 19:19
  • worked for me with PhantomJS. Used this as a starting point [link](https://nicolas.perriault.net/code/2011/scrape-and-test-any-webpage-using-phantomjs/) – vicch Aug 30 '14 at 11:00