I have been struggling with this for a whole day. I compile a list of URLs that i get by using urllib3 on a webpage(also using BeautifulSoup). The URLs basically point to pdf files that i wanted to automate the download of. What's great is that the pdf links have a beautiful pattern. So i easily use regex to make a list of pdfs that i want to download and ignore the rest. But then that's where the problem starts. The URLs follow the pattern http://www.ti.com/lit/sboa263
NOTE: The last part changes for other files and NO pdf file extension.
But if you put this link in your browser you can clearly see it change from that to http://www.ti.com/general/docs/lit/getliterature.tsp?baseLiteratureNumber=sboa263 and that eventually changes to http://www.ti.com/lit/an/sboa263/sboa263.pdf
Now i understand you can tell me that "Great, follow this pattern then". But i don't want to because IMO this is not the right way to solve this automation. I want to be able to download the pdf from the first link itself.
I have tried
response = requests.get(url,allow_redirects=True)
which only takes me to the result of the first redirect NOT the last file.
Even response.history
takes me only to the first redirect.
And when i anyway try to download the file there i get a corrupted pdf that wont open. However when i manually pass the final URL just to test the correctness of my file write, i get the actual pdf in perfect order. I don't understand why requests is not able to get to the final URL. My full code is below for your reference -
from bs4 import BeautifulSoup
import urllib3
import re
http = urllib3.PoolManager()
url = 'http://www.ti.com/analog-circuit/circuit-cookbook.html'
response = http.request('GET', url)
soup = BeautifulSoup(response.data)
find = re.compile("http://www.ti.com/lit/")
download_links = []
for link in soup.find_all('a'):
match = re.match(find, link.get('href'))
if match:
#print(link.get('href'))
download_links.append(link.get('href'))
To get to the redirected URL i use -
import requests
response = requests.get(download_links[45])
if response.history:
print ("Request was redirected")
for resp in response.history:
final_url = resp.url
response = requests.get(final_url)
For downloading the file i used the code below -
with open('C:/Users/Dell/Desktop/test.pdf', 'wb') as f:
f.write(response.content)
Also i would actually just like to pass a folder name and all files should get downloaded with the name of the last part of the URL itself. I am yet to figure out how to do that. I tried shutils but it didn't work. If you can help me with that part too it would be great.
EDIT: I passed the first two URLs to Postman and i got HTML, whereas passing the third URL downloads the pdf. In the HTML that i get i can clearly see that one of the Meta properties lists the final pdf URL. Here is a part of the Postman result that's relevant -
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta content="IE=8;IE=9;IE=edge" http-equiv="x-ua-compatible">
<meta content='width=device-width, initial-scale=1.0' name='viewport'>
<META HTTP-EQUIV="Pragma" CONTENT="no-cache">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<META HTTP-EQUIV="Refresh" CONTENT="1; URL=http://www.ti.com/lit/an/sboa263/sboa263.pdf">
The parts below also show the final URL, but i think you get the idea. Can we possibly make use of this information?