1

I have been struggling with this for a whole day. I compile a list of URLs that i get by using urllib3 on a webpage(also using BeautifulSoup). The URLs basically point to pdf files that i wanted to automate the download of. What's great is that the pdf links have a beautiful pattern. So i easily use regex to make a list of pdfs that i want to download and ignore the rest. But then that's where the problem starts. The URLs follow the pattern http://www.ti.com/lit/sboa263

NOTE: The last part changes for other files and NO pdf file extension.

But if you put this link in your browser you can clearly see it change from that to http://www.ti.com/general/docs/lit/getliterature.tsp?baseLiteratureNumber=sboa263 and that eventually changes to http://www.ti.com/lit/an/sboa263/sboa263.pdf
Now i understand you can tell me that "Great, follow this pattern then". But i don't want to because IMO this is not the right way to solve this automation. I want to be able to download the pdf from the first link itself.

I have tried

response = requests.get(url,allow_redirects=True)
which only takes me to the result of the first redirect NOT the last file. Even response.history takes me only to the first redirect.

And when i anyway try to download the file there i get a corrupted pdf that wont open. However when i manually pass the final URL just to test the correctness of my file write, i get the actual pdf in perfect order. I don't understand why requests is not able to get to the final URL. My full code is below for your reference -

from bs4 import BeautifulSoup
import urllib3
import re
http = urllib3.PoolManager()
url = 'http://www.ti.com/analog-circuit/circuit-cookbook.html'
response = http.request('GET', url)
soup = BeautifulSoup(response.data)
find = re.compile("http://www.ti.com/lit/")
download_links = []
for link in soup.find_all('a'):
    match = re.match(find, link.get('href'))
    if match:
        #print(link.get('href'))
        download_links.append(link.get('href'))

To get to the redirected URL i use -

import requests
response = requests.get(download_links[45])
if response.history:
    print ("Request was redirected")
    for resp in response.history:
        final_url = resp.url
response = requests.get(final_url)

For downloading the file i used the code below -

with open('C:/Users/Dell/Desktop/test.pdf', 'wb') as f:
    f.write(response.content)

Also i would actually just like to pass a folder name and all files should get downloaded with the name of the last part of the URL itself. I am yet to figure out how to do that. I tried shutils but it didn't work. If you can help me with that part too it would be great.

EDIT: I passed the first two URLs to Postman and i got HTML, whereas passing the third URL downloads the pdf. In the HTML that i get i can clearly see that one of the Meta properties lists the final pdf URL. Here is a part of the Postman result that's relevant -

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <meta content="IE=8;IE=9;IE=edge" http-equiv="x-ua-compatible">
        <meta content='width=device-width, initial-scale=1.0' name='viewport'>
        <META HTTP-EQUIV="Pragma" CONTENT="no-cache">
        <META HTTP-EQUIV="Expires" CONTENT="-1">
        <META HTTP-EQUIV="Refresh" CONTENT="1; URL=http://www.ti.com/lit/an/sboa263/sboa263.pdf">

The parts below also show the final URL, but i think you get the idea. Can we possibly make use of this information?

Miraj50
  • 4,257
  • 1
  • 21
  • 34
jar
  • 2,646
  • 1
  • 22
  • 47
  • I think, during the first redirect python knew what to do, because everything was there in the header of the 301 response i.e. the new location. But if you see the response of this second URL, it is actually HTML containing the new URL inside the Meta Tag. So, your user agent knows that it has to refresh that page with the new URL. Python will probably come to know nothing about the new URL from there. So you have to handle redirects from Meta tags. Maybe [this link](https://stackoverflow.com/questions/2318446/how-to-follow-meta-refreshes-in-python) will be useful. – Miraj50 May 06 '18 at 08:44

1 Answers1

1

As mentioned by Miraj50 it was indeed a Meta Refresh that took it to the final url. So i extracted the final urls from the meta tag and was able to download all the 45 pdfs. Below is the code for the same -

for links in download_links[5:]:
    response = requests.get(links)
    if response.history:
        print ("Request was redirected")
        print(response.url)
        r = response.url

Getting links from the Meta Tag -

# The redirected url then uses meta refresh
meta_response = http.request('GET', r)
meta_soup = BeautifulSoup(meta_response.data)
meta_result = meta_soup.find('meta',attrs={'http-equiv':'Refresh'})
#print(meta_result)
wait,text = meta_result["content"].split(";")
final_url = text.strip()[4:]
jar
  • 2,646
  • 1
  • 22
  • 47