I am new to python and my work is trying to export some historical data. What I am trying to do is save hundreds of url links as individual pdfs so we don't have to click and save each one by one. The urls are direct links to forms that I would like to download. The webpage also has username password authentication. I cant seem to get python to export the url link in any format; at first it seemed as if the webpage was not allowing me access because of the username/password but after I added the requests.get and auth piece, the script seems to run but no export is created.
as one of the commenters suggested pywebcopy, i tried it and this tool successfully creates a folder and a html file in the destination with the correct url file name but the file itself is blank. I added the authentication piece but it made no difference as the saved html file is still blank.
import requests
requests.get('main website url', auth=('username','password'))
urls = ['url1','url2','url3' etc]
output_dir = 'folder on my drive'
for url in urls:
response = requests.get(url)
if response.status_code == 200:
file_path = os.path.join(output_dir, os.path.basename(url))
with open(file_path, 'wb') as f:
f.write(response.content)
this is the http response
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): apps.bell.com:443
send: b'GET / HTTP/1.1\r\nHost: apps.bell.com\r\nUser-Agent: python-requests/2.27.1\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 302 \r\n'
header: Set-Cookie: JSESSIONID=B521B7D0D66595F87A12E174CA2C4CAD; Path=/; Secure; HttpOnly
header: Location: https://apps.bell.com/apps/bell/bell.bellmain
header: Content-Type: text/html;charset=ISO-8859-1
header: Content-Length: 0
header: Date: Mon, 16 May 2022 00:26:34 GMT
header: Keep-Alive: timeout=1
header: Connection: keep-alive
DEBUG:urllib3.connectionpool:https://apps.bell.com:443 "GET / HTTP/1.1" 302 0
send: b'GET /apps/bell/bell.bellmain HTTP/1.1\r\nHost: apps.bell.com\r\nUser-Agent: python-requests/2.27.1\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nCookie: JSESSIONID=B521B7D0D66595F87A12E174CA2C4CAD\r\n\r\n'
reply: 'HTTP/1.1 401 \r\n'
header: WWW-Authenticate: Basic realm="bellProduction System - V8MU.Q3"
header: Content-Type: text/html
header: Content-Length: 522
header: Date: Mon, 16 May 2022 00:26:34 GMT
header: Keep-Alive: timeout=1
header: Connection: keep-alive
DEBUG:urllib3.connectionpool:https://apps.bell.com:443 "GET /apps/bell/bell.bellmain HTTP/1.1" 401 522
PS >