0

Is there any way to scrape beyond the first page of this:

https://www.sportstats.ca/display-results.xhtml?raceid=23666

I've tried Selenium in the past and have had varying degrees of success. I find it very heavy, sometimes it doesn't work, sometimes it hangs. If at all possible I would prefer to avoid it and just use urllib.request and do something with headers/cookies to find the data I'm looking for.

These are the roadblocks:

1) The URL doesn't change when you go to another page.

2) The link to go to the next page (for example) is JS or something, and is not easy to handle:

<li><a id="mainForm:j_idt341" href="#" class="ui-commandlink ui-widget fa fa-angle-right" onclick="PrimeFaces.ab({s:&quot;mainForm:j_idt341&quot;,p:&quot;mainForm&quot;,u:&quot;mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog&quot;,onco:function(xhr,status,args){hideDetails('athlete-popup');showDetails('event-popup');scrollToTopOfElement('mainForm\\:result_table');;}});return false;"></a>

Can anyone point me in the right direction to walk through this and scrape each page.

user3449833
  • 779
  • 2
  • 10
  • 28
  • Why have you deleted the question? I was researching it for like 20 minutes and finally got a solution. Undelete it please: http://stackoverflow.com/questions/33427870/python-selenium-scrape-hidden-data. – alecxe Oct 30 '15 at 03:21

1 Answers1

0

I think you can do it with Selenium without much bother. The id's of the buttons follow a pattern "mainForm:j_idt336:0:j_idt338", incrementing. You can find the buttons from Selenium with find by id, you can treat the ">" button separately to move forward, also by its id. The ids seem to be somehow generated, but you could make your selenium script take that format as a param, and create another script just for obtaining that id format. Also take a look at mechanize.

gplayer
  • 1,741
  • 1
  • 14
  • 15
  • With Selinium, even after I "click" to the next page, the underlying source code often (not consistently) comes back with the first page again. And it seems no combination of waiting/refreshing/clicking next again can get it "unstuck" and have it move forward. – user3449833 Oct 19 '15 at 18:00
  • It seems to me that the issue appears because the page is not yet loaded when you get the new contents. Try to apply some of the hints from here: http://stackoverflow.com/questions/10720325/selenium-webdriver-wait-for-complex-page-with-javascriptjs-to-load. – gplayer Oct 20 '15 at 06:48