1

I am new to python and my work is trying to export some historical data. What I am trying to do is save hundreds of url links as individual pdfs so we don't have to click and save each one by one. The urls are direct links to forms that I would like to download. The webpage also has username password authentication. I cant seem to get python to export the url link in any format; at first it seemed as if the webpage was not allowing me access because of the username/password but after I added the requests.get and auth piece, the script seems to run but no export is created.

as one of the commenters suggested pywebcopy, i tried it and this tool successfully creates a folder and a html file in the destination with the correct url file name but the file itself is blank. I added the authentication piece but it made no difference as the saved html file is still blank.


import requests

requests.get('main website url', auth=('username','password'))

urls = ['url1','url2','url3' etc]

output_dir = 'folder on my drive'

for url in urls:
    response = requests.get(url)
    if response.status_code == 200:
        file_path = os.path.join(output_dir, os.path.basename(url))
        with open(file_path, 'wb') as f:
            f.write(response.content)

this is the http response

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): apps.bell.com:443
send: b'GET / HTTP/1.1\r\nHost: apps.bell.com\r\nUser-Agent: python-requests/2.27.1\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 302 \r\n'
header: Set-Cookie: JSESSIONID=B521B7D0D66595F87A12E174CA2C4CAD; Path=/; Secure; HttpOnly
header: Location: https://apps.bell.com/apps/bell/bell.bellmain
header: Content-Type: text/html;charset=ISO-8859-1
header: Content-Length: 0
header: Date: Mon, 16 May 2022 00:26:34 GMT
header: Keep-Alive: timeout=1
header: Connection: keep-alive
DEBUG:urllib3.connectionpool:https://apps.bell.com:443 "GET / HTTP/1.1" 302 0
send: b'GET /apps/bell/bell.bellmain HTTP/1.1\r\nHost: apps.bell.com\r\nUser-Agent: python-requests/2.27.1\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nCookie: JSESSIONID=B521B7D0D66595F87A12E174CA2C4CAD\r\n\r\n'
reply: 'HTTP/1.1 401 \r\n'
header: WWW-Authenticate: Basic realm="bellProduction System - V8MU.Q3"
header: Content-Type: text/html
header: Content-Length: 522
header: Date: Mon, 16 May 2022 00:26:34 GMT
header: Keep-Alive: timeout=1
header: Connection: keep-alive
DEBUG:urllib3.connectionpool:https://apps.bell.com:443 "GET /apps/bell/bell.bellmain HTTP/1.1" 401 522
PS > 
dtx780
  • 29
  • 5
  • Are you sure you're authenticated when issuing the get requests in the loop? – EDG956 May 15 '22 at 10:48
  • This is too little information. Can you share the http response that you get? If you don't know how, check out https://stackoverflow.com/questions/10588644/how-can-i-see-the-entire-http-request-thats-being-sent-by-my-python-application?answertab=scoredesc#tab-top – MennoK May 15 '22 at 11:06
  • Ive attached the http response. When I run my script without the authentication piece, I get a http 401 error. – dtx780 May 16 '22 at 00:35
  • Are you trying to extract the actual html source? or are you trying to capture what the webpage looks like? – Alexander May 16 '22 at 00:36
  • hi alexpdev, im trying to save a copy of the webpage and its contents in a readable format (with all its data) whether that be a html file, pdf, etc – dtx780 May 16 '22 at 00:37
  • hi @dtx780 are you looking for something like this? [pywebcopy](https://github.com/rajatomar788/pywebcopy) – ahmedshahriar May 16 '22 at 00:42
  • What do you mean by " I cant seem to get python to export the url link in any format" and "the script seems to run but no export is created"? Please give more details. – Code-Apprentice May 16 '22 at 00:48
  • And what do you mean "this is the http response"? Where does the following block come from? – Code-Apprentice May 16 '22 at 00:49
  • ahmedshahriar - yes this would work. when i try to use pywebcopy, it creates a blank html file in my output directory. – dtx780 May 16 '22 at 01:05
  • code apprentice - the http response is the output from the stackoverflow thread suggested by mennok. – dtx780 May 16 '22 at 01:07

1 Answers1

3

The auth keyword argument to that function expects an authentication object. For convenience, if passed a tuple, it acts as though it was asked to do HTTP Basic Authentication. This authentication mechanism is not stateful, so you have to pass the auth parameter to every get call.

You might be saying: "But I don't have to do that in my browser". And that's correct. Most web browsers these days (definitely Firefox and Chrome, I can personally attest to) will remember HTTP Basic Auth credentials for websites you've been to and automatically send them if asked again for the same site, so you don't see the same prompt a bunch of times. But that's something your web browser does, not something the server does. So when you're making HTTP requests by hand, you're responsible for doing the same.

Silvio Mayolo
  • 62,821
  • 6
  • 74
  • 116