I'm trying to write a web-scraper that takes a given url from Cambridge Books Online, downloads individual chapters and concatenates them into a single pdf, e.g., http://ebooks.cambridge.org/ebook.jsf?bid=CBO9781139171540
.
Unfortunately, the pdfs don't appear to be stored in any kind of directory like structure, but are rather queried through some kind of api.
An example url for a chapter from a book might look something like:
http://ebooks.cambridge.org/pdf_viewer.jsf?cid=CBO9781139171540A003&ref=false&pubCode=CUP&urlPrefix=cambridge&productCode=cbo
It's not readily apparent to me what kind of file I'd get if I just tried to download that, though that url opens the desired chapter in Chrome's pdf reader.
From other reading, it seems like there exist interpreters, like Splinter, for opening web pages from within python.
The approach most feasible to me, at the moment, involves opening the page inside a python script through something like Splinter, clicking on the relevant links that open the pdf's in pop-up windows, and downloading the pdfs from the pop-up windows, as a human would do but done via Python. Is there a good package for this kind of manipulation?
Alternatively, any other approaches to this problem would be greatly appreciated.
EDIT: I should clarify, that the primary challenge I'm having with the conventional BeautifulSoup approach is that the url of the pdf isn't the actual pdf itself, but an html page that loads the pdf via Ajax. For example, the chapter url I linked, if downloaded, looks like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--[if lt IE 7]> <html lang="en" class="no-js ie6 oldie"> <![endif]-->
<!--[if IE 7]> <html lang="en" class="no-js ie7 oldie"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="no-js ie8 oldie"> <![endif]-->
<!--[if gt IE 8]><!-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class='no-js'>
<!--<![endif]-->
<head>
<title>CBO9781139171540A003</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<style type="text/css">
html, body, div, iframe { margin:0; padding:0; height:100%; }
iframe { display:block; width:100%; border:none; }
</style>
</head>
<body>
<div id="loading_icon" style="margin-top: 350px; text-align: center; min-height:678px;">
<b>Please wait, page is loading...</b><br /><img width="208" height="13" src="images/loadingAnimation.gif" alt="Loading..." />
</div>
<div id="base" class="container" style="display: none;">
<div id="main" role="main">
</div>
</div>
<script src='js/jquery.min.1.7.2.js' type='text/javascript'></script><noscript>Your browser does not support JavaScript!</noscript>
<script type="text/javascript">
$(function(){
$.ajax({
type: "POST",
url: "pdf_info/CBO9781139171540A003",
success: function(data) {
var UA = "Python-urllib/3.5";
var OSindex = UA.indexOf("OS");
var OSver = UA.substr(OSindex, 6 );
var ipad2ver = "OS 3_2";
var isIpad2 = ipad2ver <= OSver;
if(("Python-urllib/3.5".indexOf("iPad") > -1 && isIpad2) ||
("Python-urllib/3.5".indexOf("iPhone") > -1)){
var dimension = data.split(",");
dimension[0] = parseFloat(dimension[0]) + 16;
$("#main").append(
$("<iframe />")
.attr("id", "pdf_frame")
.attr("style", "width:" + dimension[0] + "px; height:" + dimension[1] + "px;")
.attr("src", "open_pdf/CBO9781139171540A003;jsessionid=A3BD36F983D3B975B34BD467BB14AC7E?pubCode=CUP&urlPrefix=cambridge&productCode=cbo&isSearch=false")
);
} else {
$("#main").append(
$("<iframe />")
.attr("id", "pdf_frame")
.attr("src", "open_pdf/CBO9781139171540A003;jsessionid=A3BD36F983D3B975B34BD467BB14AC7E?pubCode=CUP&urlPrefix=cambridge&productCode=cbo&isSearch=false")
);
}
$("#loading_icon").hide();
$("#base").show();
}
});
});
</script>
</body>
</html>
Perhaps I just don't know as much as I should about Ajax, but it's not immediately obvious to me what I should be opening/ downloading with urllib and BeautifulSoup.
Thanks again!