Scraping Biography.com using urllib2

Question

So I've scraped websites before, but this time I am stumped. I am attempting to search for a person on Biography.com and retrieve his/her biography. But whenever I search the site using urllib2 and query the URL: http://www.biography.com/search/ I get a blank page with no data in it.

When I look into the source generated in the browser by clicking View Source, I still do not see any data. When I use Chrome's developer tools, I find some data but still no links leading to the biography.

I have tried changing the User Agent, adding referrers, using cookies in Python but to no avail. If someone could help me out with this task it would be really helpful.

I am planning to use this text for my NLP project and worst case, I'll have to manually copy-paste the text. But I hope it doesn't come to that.

I believe biography.com uses AJAX to search, so you need to get the JS code of the page to see what requests it's making, and make those yourself. — wavemode, Apr 18 '14 at 22:09
I just know basic JS and with what I know, I am unable to decipher how to make those JS calls. I'll appreciate any inputs. — aa8y, Apr 18 '14 at 22:11
you could use [selenium webdriver](http://selenium.googlecode.com/svn/trunk/docs/api/py/index.html) or [ghost.py](http://jeanphix.me/Ghost.py/) to get pages that are generated using Javascript. — jfs, Apr 19 '14 at 20:00

score 5 · Accepted Answer · edited May 23 '17 at 10:27

Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.

https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0

The search term I used is in the q= part of the query string: q=Barack%20Obama.

This returns JSON inside of which there is a key link with the value of the article of interest's URL.

"link": "http://www.biography.com/people/barack-obama-12782369"

Visiting that page shows me that this is generated by a request to:

http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/@published/@by-custom-type/ContentPerson/@by-slug/barack-obama-12782369

which returns JSON containing HTML.

So, replacing the last part of the link barack-obama-12782369 with the relevant info for the person of interest in the saymedia-content link may well pull out what you want.

To implement:

You'll need to use urllib2 (or requests) to do the search via their Google API call, using urllib2.urlopen(url) or requests.get(url). Replace the Barack%20Obama with a URL escaped search string, e.g. Bill%20Clinton.
Parse the JSON using Python's json module to extract the string that gives you the http://www.biography.com/people link. From this, extract the part of this link of interest (as barack-obama-12782369 above).
Use urllib2 or requests to do a saymedia-content API request replacing barack-obama-12782369 after @by-slug/ with whatever you extract from 2; i.e. do another urllib2.urlopen on this URL.
Parse the JSON from the response of this second request to extract the content you want.

(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)

Alternatively, you can use Selenium to visit the website, do the search and then extract the content.

[I've implemented your suggestions. It seems to work](https://gist.github.com/zed/07b4b2f5b13507ac33af) — jfs, Apr 19 '14 at 20:48
Nice work on coding it up. (I'd left it as an exercise for the reader.) — Steven Maude, Apr 20 '14 at 14:56

score 0 · Answer 2 · answered Apr 18 '14 at 22:25

0

You will most likely need to manually copy and paste, as biography.com is a completely javascript-based site, so it can't be scraped with traditional methods.

answered Apr 18 '14 at 22:25

wavemode

2,076
1
19
24

2

There are two solutions to this. Either use a traditional scraper on the AJAX operation itself, or use a headless browser that can run JavaScript. – halfer Apr 18 '14 at 22:37
@halfer, could you please elaborate. If you could just let me know the first few steps, I'll try to code it up. – aa8y Apr 18 '14 at 22:38
1

@Arun: using a live AJAX viewer, I can't see what AJAX operation is used to retrieve the text data; perhaps this is by design? I am wondering if it retrieves the data via sockets. So, it looks like using a traditional dumb scraper won't work. However, if you know some JavaScript, I suspect you could do this with PhantomJS - it runs JavaScript in the context of a WebKit browsing session on your computer. – halfer Apr 18 '14 at 23:04
Can you tell me where can I find a live AJAX viewer? I googled the term but did not get any reasonable hits. – aa8y Apr 18 '14 at 23:21
Check this answer: http://stackoverflow.com/questions/23058939/retrieving-scripted-page-urls-via-web-scrape/23059771#23059771 – dilbert Apr 19 '14 at 00:04
-1; though they can be tricky, sites loading content with JavaScript can, in some cases, be scraped with simple requests. Even if this fails, you can resort to web drivers or headless browsers to avoid the need for copy-pasting. – Steven Maude Apr 19 '14 at 13:59

score -1 · Answer 3 · answered Feb 18 '16 at 16:49

You can discover an api url with httpfox (firefox addon). f.e. http://www.biography.com/.api/item/search?config=published&query=marx brings you a json you can process searching for /people/ to retrive biography links. Or you can use an screen crawler like selenium

Scraping Biography.com using urllib2

3 Answers3