Extract URL from webpage and save to disk

Question

I am trying to write a script to automaotmcally query sci-hub.io with an article's title and save a PDF copy of the articles full text to my computer with a specific file name.

To do this I have written the following code:

url = "http://sci-hub.io/"
data = read_csv("C:\\Users\\Sangeeta's\\Downloads\\distillersr_export (1).csv")
for index, row in data.iterrows():
    try:
        print('http://sci-hub.io/' + str(row['DOI']))
        res = requests.get('http://sci-hub.io/' + str(row['DOI']))
        print(res.content)
    except:
        print('NO DOI: ' + str(row['ref']))

This opens a CSV file with a list of DOI's and names of the file to be saved. For each DOI, it then queries sci-hub.io for the full-text. The presented page embeds the PDF in however I am now unsure how to extract the URL for the PDF and save it to disk.

An example of the page can be seen in the image below:

In this image, the desired URL is http://dacemirror.sci-hub.io/journal-article/3a257a9ec768d1c3d80c066186aba421/pajno2010.pdf.

How can I automatically extract this URL and then save the PDF file to disk?

When I print res.content, I get this:

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title></title>\n        <meta charset="UTF-8">\n        <meta name="viewport" content="width=device-width">\n    </head>\n    <body>\n    <style type = "text/css">\n        body {background-color:#F0F0F0}\n        div {overflow: hidden; position: absolute;}\n        #top {top:0;left:0;width:100%;height:50px;font-size:14px} /* 40px */\n        #content {top:50px;left:0;bottom:0;width:100%}\n        p {margin:0;padding:10px}\n        a {font-size:12px;font-family:sans-serif}\n        a.target {font-weight:normal;color:green;margin-left:10px}\n        a.reopen {font-weight:normal;color:blue;text-decoration:none;margin-left:10px}\n        iframe {width:100%;height:100%}\n        \n        p.agitation {padding-top:5px;font-size:20px;text-align:center}\n        p.agitation a {font-size:20px;text-decoration:none;color:green}\n\n        .banner {position:absolute;z-index:9999;top:400px;left:0px;width:300px;height:225px;\n                 border: solid 1px #ccc; padding: 5px;\n                 text-align:center;font-size:18px}\n        .banner img {border:0}\n        \n        p.donate {padding:0;margin:0;padding-top:5px;text-align:center;background:green;height:40px}\n        p.donate a {color:white;font-weight:bold;text-decoration:none;font-size:20px}\n\n        #save {position:absolute;z-index:9999;top:180px;left:8px;width:210px;height:36px;\n                 border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n                 text-align:center;font-size:18px;background:#F0F0F0;color:#333}\n\n        #save a {text-decoration:none;color:white;font-size:inherit;color:#666}\n\n        #save p { margin: 0; padding: 0; margin-top: 8px}\n\n        #reload {position:absolute;z-index:9999;top:240px;left:8px;width:210px;height:36px;\n                 border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n                 text-align:center;font-size:18px;background:#F0F0F0;color:#333}\n\n        #reload a {text-decoration:none;color:white;font-size:inherit;color:#666}\n\n        #reload p { margin: 0; padding: 0; margin-top: 8px}\n\n\n        #saveastro {position:absolute;z-index:9999;top:360px;left:8px;width:230px;height:70px;\n                    border-radius: 4px; border: solid 1px #ccc; background: white; text-align:center}\n        #saveastro p { margin: 0; padding: 0; margin-top: 16px}\n        \n        \n        #donate {position:absolute;z-index:9999;top:170px;right:16px;width:220px;height:36px;\n                 border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n                 text-align:center;font-size:18px;background:white;color:#333}\n        \n        #donate a {text-decoration:none;color:green;font-size:inherit}\n\n        #donatein {position:absolute;z-index:9999;top:220px;right:16px;width:220px;height:36px;\n                 border-radius: 4px; border: solid 1px #ccc; padding: 5px;\n                 text-align:center;font-size:18px;background:green;color:#333}\n\n        #donatein a {text-decoration:none;color:white;font-size:inherit}\n        \n        #banner {position:absolute;z-index:9999;top:50%;left:45px;width:250px;height:250px; padding: 0; border: solid 1px white; border-radius: 4px}\n        \n    </style>\n    \n        \n    \n    <script type = "text/javascript">\n        window.onload = function() {\n            var url = document.getElementById(\'url\');\n            if (url.innerHTML.length > 77)\n                url.innerHTML = url.innerHTML.substring(0,77) + \'...\';\n        };\n    </script>\n    <div id = "top">\n        \n        <p class="agitation" style = "padding-top:12px">\n            \xd0\xa1\xd1\x82\xd1\x80\xd0\xb0\xd0\xbd\xd0\xb8\xd1\x87\xd0\xba\xd0\xb0 \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x82\xd0\xb0 Sci-Hub \xd0\xb2 \xd1\x81\xd0\xbe\xd1\x86\xd0\xb8\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd1\x8b\xd1\x85 \xd1\x81\xd0\xb5\xd1\x82\xd1\x8f\xd1\x85 \xe2\x86\x92  <a target="_blank" href="https://vk.com/sci_hub">vk.com/sci_hub</a>\n        </p>\n        \n    </div>\n    \n    <div id = "content">\n        <iframe src = "http://moscow.sci-hub.io/202d9ebdfbb8c0c56964a31b2fdfe8e9/roerdink2016.pdf" id = "pdf"></iframe>\n    </div>\n    \n    <div id = "donate">\n        <p><a target = "_blank" href = "//sci-hub.io/donate">\xd0\xbf\xd0\xbe\xd0\xb4\xd0\xb4\xd0\xb5\xd1\x80\xd0\xb6\xd0\xb0\xd1\x82\xd1\x8c \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x82 &rarr;</a></p>\n    </div>\n    <div id = "donatein">\n        <p><a target = "_blank" href = "//sci-hub.io/donate">support the project &rarr;</a></p>\n    </div>\n    <div id = "save">\n        <p><a href = # onclick = "location.href=\'http://moscow.sci-hub.io/202d9ebdfbb8c0c56964a31b2fdfe8e9/roerdink2016.pdf?download=true\'">\xe2\x87\xa3 \xd1\x81\xd0\xbe\xd1\x85\xd1\x80\xd0\xb0\xd0\xbd\xd0\xb8\xd1\x82\xd1\x8c \xd1\x81\xd1\x82\xd0\xb0\xd1\x82\xd1\x8c\xd1\x8e</a></p>\n    </div>\n    <div id = "reload">\n        <p><a href = "//sci-hub.io/reload/10.1016/j.anai.2016.01.022" target = "_blank">&#8635; \xd1\x81\xd0\xba\xd0\xb0\xd1\x87\xd0\xb0\xd1\x82\xd1\x8c \xd0\xb7\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbe</a></p>\n    </div>\n    \n        \n<!-- Yandex.Metrika counter --> <script type="text/javascript"> (function (d, w, c) { (w[c] = w[c] || []).push(function() { try { w.yaCounter10183018 = new Ya.Metrika({ id:10183018, clickmap:true, trackLinks:true, accurateTrackBounce:true, ut:"noindex" }); } catch(e) { } }); var n = d.getElementsByTagName("script")[0], s = d.createElement("script"), f = function () { n.parentNode.insertBefore(s, n); }; s.type = "text/javascript"; s.async = true; s.src = "https://mc.yandex.ru/metrika/watch.js"; if (w.opera == "[object Opera]") { d.addEventListener("DOMContentLoaded", f, false); } else { f(); } })(document, window, "yandex_metrika_callbacks"); </script> <noscript><div><img src="https://mc.yandex.ru/watch/10183018?ut=noindex" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter -->\n    </body>\n</html>\n'

Which does include the URL, however I am unsure how to extract it.

Update:

I am now able to extract the URL but when I try to access the page with the PDF (through urllib.request) I get a 403 response even though the URL is valid. Any ideas on why and how to fix? (I am able to access through my browser so not IP blocked)

Are you able to see this pdf URL in res.content object. If yes then you can use a regular expression to extract that url and use the urllib module( urllib.urlretrieve(url,filename)) to save pdf in hard disk.filename is the place where you want to save this PDF. — SB07, Sep 17 '17 at 15:40
@SB07 Have edited to include res.content. How can I get the URL from it? — , Sep 17 '17 at 15:46
First off, please consider the load you're puttin on sci-hubs servers. Be nice and add some throttling at least. — Konstantin Schubert, Sep 17 '17 at 15:51
@KonstantinSchubert I will - the file currently only contains 1 DOI. — , Sep 17 '17 at 15:53
You can use a regular expression to extract the url. Have a look here https://stackoverflow.com/questions/4666973/how-to-extract-a-substring-from-inside-a-string-in-python and here https://pythex.org/ — Konstantin Schubert, Sep 17 '17 at 15:53
@BillBell I have a list of tiles & DOI's. Plan to first search using DOI's and if nothing comes up title. — , Sep 17 '17 at 15:58

score 1 · Answer 1 · answered Sep 17 '17 at 15:55

1

You can use urllib library to access the html of the page and even download files, and regex to find the url of the file you want to download.

import urllib
import re    

site = urllib.urlopen(".../index.html")
data = site.read() # turns the contents of the site into a string
files = re.findall('(http|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?(.pdf)', data) # finds the url

for file in files:
    urllib.urlretrieve(file, filepath) # "filepath" is where you want to save it

answered Sep 17 '17 at 15:55

Roxerg

116
1
13

When I run this I get this error: `AttributeError: module 'urllib' has no attribute 'urlretrieve'` – Sep 17 '17 at 16:05
@apostrophe are you using python 3? In that case, it's `import urllib.request data = urllib.request.urlretrieve("file, filepath")` – Roxerg Sep 17 '17 at 16:09

score 0 · Answer 2 · edited Sep 17 '17 at 17:10

0

Here is the Solution:-

url = re.search('<iframe src = "\s*([^"]+)"', res.content)
url.group(1)
urllib.urlretrieve(url.group(1),'C:/.../Docs/test.pdf')

I ran it and it is working :)

For Python 3:

Change urrlib.urlretrive to urllib.request.urlretrieve

edited Sep 17 '17 at 17:10

answered Sep 17 '17 at 16:02

SB07

76
7

yes for that before these lines use 'import urllib import re' – SB07 Sep 17 '17 at 16:08

score 0 · Answer 3 · answered Sep 17 '17 at 20:39

You can do it with a clunky code requiring selenium, requests and scrapy.

Use selenium to request either an article title or DOI.

>>> from selenium import webdriver
>>> driver.get("http://sci-hub.io/")
>>> input_box = driver.find_element_by_name('request')
>>> input_box.send_keys('amazing scientific results\n')

An article by the title 'amazing scientific results' doesn't seem to exist. As a result, the site returns a diagnostic page in the browser window which we can ignore. It also puts 'http://sci-hub.io/' in webdriver's current_url property. This is helpful because it's an indication that the requested result isn't available.

>>> driver.current_url
'http://sci-hub.io/'

Let's try again, looking for the item that you know exists.

>>> driver.get("http://sci-hub.io/")
>>> input_box = driver.find_element_by_name('request')
>>> input_box.send_keys('DOI: 10.1016/j.anai.2016.01.022\n')
>>> driver.current_url
'http://sci-hub.io/10.1016/j.anai.2016.01.022'

This time the site returns a distinctive url. Unfortunately, if we load this using selenium we will get the pdf and, unless you're more able than I am, you will find it difficult to download this to a file on your machine.

Instead, I download it using the requests library. Loaded in this form you will find that the url of the pdf becomes visible in the HTML.

>>> import requests
>>> r = requests.get(driver.current_url)

To ferret out the url I use scrapy.

>>> from scrapy.selector import Selector
>>> selector = Selector(text=r.text)
>>> pdf_url = selector.xpath('.//iframe/@src')[0].extract()

Finally I use requests again to download the pdf so that I can save it to a conveniently named file on local storage.

>>> r = requests.get(pdf_url).content
>>> open('article_name', 'wb').write(r)
211853

score 0 · Answer 4 · answered Sep 17 '17 at 22:34

I solved this using a combination of the answers above - namely SBO7 & Roxerg.

I use the following to extract the URL from the page and then download the PDF:

    res = requests.get('http://sci-hub.io/' + str(row['DOI']))
    useful = BeautifulSoup(res.content, "html5lib").find_all("iframe")
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(useful[0]))
    response = requests.get(urls[0])
    with open("C:\\Users\\Sangeeta's\\Downloads\\ref\\" + str(row['ref']) + '.pdf', 'wb') as fw:
        fw.write(response.content)

Note: This will not work for all articles - some link to webpages (example) and this doesn't correctly work for those.

Extract URL from webpage and save to disk

4 Answers4