Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.
https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0
The search term I used is in the q=
part of the query string: q=Barack%20Obama
.
This returns JSON inside of which there is a key link
with the value of the article of interest's URL.
"link": "http://www.biography.com/people/barack-obama-12782369"
Visiting that page shows me that this is generated by a request to:
http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/@published/@by-custom-type/ContentPerson/@by-slug/barack-obama-12782369
which returns JSON containing HTML.
So, replacing the last part of the link barack-obama-12782369
with the relevant info for the person of interest in the saymedia-content
link may well pull out what you want.
To implement:
- You'll need to use
urllib2
(or requests
) to do the search via their Google API call, using urllib2.urlopen(url)
or requests.get(url)
. Replace the Barack%20Obama
with a URL escaped search string, e.g. Bill%20Clinton
.
- Parse the JSON using Python's
json
module to extract the string that gives you the http://www.biography.com/people
link. From this, extract the part of this link of interest (as barack-obama-12782369
above).
- Use
urllib2
or requests
to do a saymedia-content
API request replacing barack-obama-12782369
after @by-slug/
with whatever you extract from 2; i.e. do another urllib2.urlopen
on this URL.
- Parse the JSON from the response of this second request to extract the content you want.
(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)
Alternatively, you can use Selenium to visit the website, do the search and then extract the content.