0

I am developing a web crawler to get some job ads from www.seek.com.au. When I search for something say "ios", the request url will be "http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=ios&page=1&isAreaUnspecified=false&location=&area=&nation=3000&sortMode=KeywordRelevance&searchFrom=quick&searchType="

If it is in a browser, you will see all the ios related jobs. However, there is actually no job info found there at all in the http response. So I guess the data is fetched in an ajax way. But surprisingly, there is no such kind of ajax request/response I could possibly find after carefully going through the info provided by browser monitoring tool.

So my question is, how is the job related data loaded? Where do they come from? If it is a ajax request, what is the url? What does the response look like?

Robin Sun
  • 93
  • 1
  • 2
  • 9
  • Install and run a tool like Fiddler to log network traffic. – Richard Mar 08 '14 at 03:09
  • 2
    Go back to the page and open the "inspector tools" - click on the "network" tab and there will be a small grey circle that records the activity on the webpage from your requests - Click it - Then watch the information icons that show up. – les Mar 08 '14 at 03:11
  • Well, with noscript, all I see are a bunch of dots moving across the screen, so there's clearly some script that loads it. You could start by tracing through that. (Why do all websites these days require client scripting to just show a page?) – lc. Mar 08 '14 at 03:12
  • I am using Google Chrome and the "Network" console to monitor requests/responses. But I could not find anything related to the loading approach of job info. I agree that it is very likely that a script does the job and I do think websites these days rely on client scripting as a way to prevent crawlers. – Robin Sun Mar 08 '14 at 03:26

1 Answers1

0

The first time you click on the URL, the whole URL is sent to the server. The server could parse the "hash" part of the URL (or Fragment Identifier) and retrieve the initial state of the page along with the HTML code. That's possibly the reason why you're not able to see any AJAX request then.

Now, if you edit the hash part of the URL, Javascript will be able to detect that (see https://stackoverflow.com/a/680865/368544, for an example).

I tried it on the page and found a GET AJAX request to https://api.seek.com.au/v2/jobs/search?&callback=jQuery18206269974991255204_1394249755755&keywords=&hirerId=&hirerGroup=&page=1&classification=&subclassification=&graduateSearch=false&location=&nation=3000&area=&isAreaUnspecified=false&worktype=&salaryRange=0-999999&salaryType=annual&dateRange=999&sortMode=ListedDate&engineConfig=&usersessionid=bkbtlmlxcrqb4tfi5mvmck1r&_=1394249843957 which produces a JSONP response, apparently. The parameters look quite similar.

Community
  • 1
  • 1
Martín Schonaker
  • 7,273
  • 4
  • 32
  • 55