0

I'm trying to write a web-scraper that takes a given url from Cambridge Books Online, downloads individual chapters and concatenates them into a single pdf, e.g., http://ebooks.cambridge.org/ebook.jsf?bid=CBO9781139171540.

Unfortunately, the pdfs don't appear to be stored in any kind of directory like structure, but are rather queried through some kind of api.

An example url for a chapter from a book might look something like:

http://ebooks.cambridge.org/pdf_viewer.jsf?cid=CBO9781139171540A003&ref=false&pubCode=CUP&urlPrefix=cambridge&productCode=cbo

It's not readily apparent to me what kind of file I'd get if I just tried to download that, though that url opens the desired chapter in Chrome's pdf reader.

From other reading, it seems like there exist interpreters, like Splinter, for opening web pages from within python.

The approach most feasible to me, at the moment, involves opening the page inside a python script through something like Splinter, clicking on the relevant links that open the pdf's in pop-up windows, and downloading the pdfs from the pop-up windows, as a human would do but done via Python. Is there a good package for this kind of manipulation?

Alternatively, any other approaches to this problem would be greatly appreciated.

EDIT: I should clarify, that the primary challenge I'm having with the conventional BeautifulSoup approach is that the url of the pdf isn't the actual pdf itself, but an html page that loads the pdf via Ajax. For example, the chapter url I linked, if downloaded, looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<!--[if lt IE 7]> <html lang="en" class="no-js ie6 oldie"> <![endif]-->
<!--[if IE 7]>    <html lang="en" class="no-js ie7 oldie"> <![endif]-->
<!--[if IE 8]>    <html lang="en" class="no-js ie8 oldie"> <![endif]-->
<!--[if gt IE 8]><!-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class='no-js'>
<!--<![endif]-->


    <head>
        <title>CBO9781139171540A003</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />





                <style type="text/css"> 
                    html, body, div, iframe { margin:0; padding:0; height:100%; }
                    iframe { display:block; width:100%; border:none; }
                </style>


    </head>

    <body>

        <div id="loading_icon" style="margin-top: 350px; text-align: center; min-height:678px;">
            <b>Please wait, page is loading...</b><br /><img width="208" height="13" src="images/loadingAnimation.gif"  alt="Loading..." />
        </div>



        <div id="base" class="container" style="display: none;">
            <div id="main" role="main"> 


            </div>
        </div>

        <script src='js/jquery.min.1.7.2.js' type='text/javascript'></script><noscript>Your browser does not support JavaScript!</noscript>    
        <script type="text/javascript">
            $(function(){   
                $.ajax({
                    type: "POST",
                    url: "pdf_info/CBO9781139171540A003",
                    success: function(data) {

                        var UA = "Python-urllib/3.5";
                        var OSindex = UA.indexOf("OS");
                        var OSver = UA.substr(OSindex, 6 );     
                        var ipad2ver = "OS 3_2";
                        var isIpad2 = ipad2ver <= OSver;

                        if(("Python-urllib/3.5".indexOf("iPad") > -1 && isIpad2) ||
                            ("Python-urllib/3.5".indexOf("iPhone") > -1)){
                                var dimension = data.split(",");
                                dimension[0] = parseFloat(dimension[0]) + 16;     
                                $("#main").append(
                                    $("<iframe />")
                                        .attr("id", "pdf_frame")
                                        .attr("style", "width:" + dimension[0] + "px; height:" + dimension[1] + "px;")
                                        .attr("src", "open_pdf/CBO9781139171540A003;jsessionid=A3BD36F983D3B975B34BD467BB14AC7E?pubCode=CUP&urlPrefix=cambridge&productCode=cbo&isSearch=false")
                                );
                        } else {
                            $("#main").append(
                                $("<iframe />")
                                    .attr("id", "pdf_frame")
                                    .attr("src", "open_pdf/CBO9781139171540A003;jsessionid=A3BD36F983D3B975B34BD467BB14AC7E?pubCode=CUP&urlPrefix=cambridge&productCode=cbo&isSearch=false")
                                );
                        }

                        $("#loading_icon").hide();
                        $("#base").show();                      
                    }
                });
            }); 

        </script>

    </body>

</html>

Perhaps I just don't know as much as I should about Ajax, but it's not immediately obvious to me what I should be opening/ downloading with urllib and BeautifulSoup.

Thanks again!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Dragonsheep
  • 242
  • 3
  • 10
  • If it's a pdf file, you should be able to download it and then "run" that with `os.system()`. The application associated with pdf files will automatically be invoked. – martineau Aug 30 '16 at 04:21
  • I think combination of `beautifulsoup` and `wget` would be sufficient. However, getting pdf url will be tricky since the page was built with javascript. Once you have the url, you can do `wget.download(url)`. – iparjono Aug 30 '16 at 04:32
  • And please don't think that url must be in the form of `... dir/file.pdf` to be able to download. – iparjono Aug 30 '16 at 04:39
  • Thanks for your replies! Getting the pdf url seems to be indeed the challenge, since downloading the http://ebooks.cambridge.org/pdf_viewer... page provided me an html document rather than a pdf file. I've edited my original question to include the html that retrieves the pdf, but I'm not sure how python should interact with the Ajax to download the pdf. – Dragonsheep Aug 30 '16 at 06:25
  • Possibly relevant http://stackoverflow.com/questions/21069294/parse-the-javascript-returned-from-beautifulsoup – Eva Aug 30 '16 at 07:00

0 Answers0