Scrape and Extract data from https://chenmed.wd1.myworkdayjobs.com/en-US/jencare/ when it is not visible in the 'Source Code' of the webpage

Question

I am trying to write an automated PHP script to scrape and extract all 'Job Titles' (Primary Care Physician - Tidewater Market, Primary Care Physician - Richmond Market etc.) from URL https://chenmed.wd1.myworkdayjobs.com/en-US/jencare/

However, this does not seem to be straightforward because the required data is not directly visible in the source code of the webpage. I also tried inspecting 'Developer Tools->Network' of different browsers, however could not locate the source of the data.

Any help would be highly appreciated.

Thanks & Regards!

not related, but PhantomJS might handle this lot easier in my point of view. — Chay22, Feb 05 '17 at 02:35

score 7 · Accepted Answer · edited May 23 '17 at 12:01

Looking at the requests made by the website one notices an XHR request that contains the data you care about:

However visiting that URL in a browser gives the same result as navigating to https://chenmed.wd1.myworkdayjobs.com/en-US/jencare/. Investigating further by looking at the request headers

one notices the Accept:application/json,application/xml (which signifies that the client expect a json or xml document). Indeed it turns out to be true that requesting https://chenmed.wd1.myworkdayjobs.com/en-US/jencare/ with this additional header returns the desired data:

>>> import urllib.request
>>> req = urllib.request.Request('https://chenmed.wd1.myworkdayjobs.com/en-US/jencare/')
>>> req.add_header('Accept', 'application/json,application/xml')
>>> urllib.request.urlopen(req).read().decode('utf-8').find('Primary Care Physician ') > 0
True

Therefore in PHP you probably want to do the following steps:

Request ttps://chenmed.wd1.myworkdayjobs.com/en-US/jencare/ with the additional header Accept:application/json,application/xml (see e.g. How do I send a GET request with a header from PHP?)
Parse the returned JSON (e.g. using http://php.net/manual/de/function.json-decode.php)

Worked like a charm. Thanks @Jonathan – Sam Feb 05 '17 at 16:45 — Sam, Feb 05 '17 at 16:45

Scrape and Extract data from https://chenmed.wd1.myworkdayjobs.com/en-US/jencare/ when it is not visible in the 'Source Code' of the webpage

1 Answers1

Linked