0

I'm trying to a scrape using Python a certain type of website (this one for example) that usesAJAX requests with jquery to load some of it's content (I'm also aware of the very good post here, but at the moment I think Selenium might be unnecessary for my problem).

I can see using Firebug that when I load a menu cookies get set in a logical way, which use numbering system to group events like:

(Sport, Country, Competition, Event) 

e.g. for all Soccer, England events the numbers are

(7, 55,0,0)

Then when the javacript function updateCenter() is called, it uses this set of cookies to build a URL based on these cookie values, like:

 var loadUrl = "/_betting/getCenterColumn/" + centerStateCookie + "/" + selectedSport
 + "&" + selectedCategory + "&" + selectedCompetition + "&" + selectedEvent + "&" + 
 selectedLiveNowEvent + "&" + expandBetNbrInActiveSettledBets;

For my example above this looks like:

/_betting/getCenterColumn/displayEventsFromCategory/7&55&0&0&0&0

Finally an AJAX request is made to update the center DIV with content loaded from that URL: (the .html(ajax_load) initial call just loads a nice whirly timer gif in the meantime while request is processed):

$("#PluginBettingCenterContent").html(ajax_load).load(loadUrl);

All well and good, but the Firebug XHR requests actually show that the GET link requested wasn't quite the above but has some numbers appended:

 GET /_betting/getCenterColumn/displayEventsFromCategory/7&55&0&0&0&0?_=1392198690842

Where does this ?_=1392198690842 come from in such an AJAX request?

Since I can easy scrape and build the URL that goes into the AJAX load, I was hoping just to scrape these URLs directly, but I don't understand what the final set of numbers and ?_= appended to this URL GET request are, and how I could simulate computing them....

Community
  • 1
  • 1
fpghost
  • 2,834
  • 4
  • 32
  • 61
  • 1
    it's an epoch timestamp and it's appended to prevent caching: http://stackoverflow.com/questions/12225576/why-some-numbers-are-added-to-url-of-ajax-object-and-how-to-remove-them – David Fregoli Feb 12 '14 at 10:00
  • @DavidFregoli Oh, you mean so that the newest odds are always loaded rather than some old page a client may have sitting there from a previous call? I should be able to simulate this timestamp from my python code then. I can't see exactly where in the `betting.js` on the site, this timestamp gets appended to the ajax call though. Does it happen internally in `jquery`? – fpghost Feb 12 '14 at 10:03
  • 1
    yes it's an internal thing, see the linked url (it can be disabled). Browsers cache servers' response based on the url string so this is basically a workaround – David Fregoli Feb 12 '14 at 10:50

1 Answers1

1

It's likely the timestamp parameter in the URL is optional.

However, if you want to act as close as the browser, you can append the timestamp manually:

>>> import time
>>> url = 'http://example.com/index'
>>> '%s?_=%d' % (url, time.time() * 1000)
'http://example.com/index?_=1392249064418'
R. Max
  • 6,624
  • 1
  • 27
  • 34