How do you correctly parse web links to avoid a 403 error when using Wget?

Question

I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.

For reference:

Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG

Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG

Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.

Help!

#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse

## Accessing and Creating Six Digit File Code
pdf_dir = "/users/USERNAME/desktop/worky"

pdf_files = glob.glob("%s/*.pdf" % pdf_dir)

for file in pdf_files:
    ## Identify File Name and Limit to Digits
    filename = os.path.basename(file)
    newname = filename[0:6]
    
    ## Run PDFX to identify and download links
    pdf = pdfx.PDFx(filename)
    url_list = pdf.get_references_as_dict()
    attachment_counter = (1)

    for x in url_list["url"]:
        if x[0:4] == "http":
            parsed_url = urllib.parse.quote(x, safe='://')
            print (parsed_url)
            wget.download(parsed_url, '/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
            ##os.rename(r'/users/USERNAME/desktop/worky/(filename).*',r'/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
            attachment_counter += 1
    for x in url_list["pdf"]:
        print (parsed_url + "\n")```

403 forbidden means you are not authorized to access the links you are trying to get. If the s3 links are not public, you need to attachbauth info. You can try printing a link out before you get it to compare — Patrick Magee, Jul 17 '20 at 13:53
In the links it has authentication codes and access keys, which leads me to believe I can access them. I am able to click through the link from the PDF itself, if I were doing it manually. — jss3000, Jul 17 '20 at 13:55

RomainM · Accepted Answer · 2020-07-17T15:06:17.963

I prefer to use requests (https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget and I got the same error (might be linked to user-agent HTTP headers used by wget).

wget and HTTP headers issues : download image from url using python urllib but receiving HTTP Error 403: Forbidden
HTTP headers : https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

The good thing with requests is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).

import requests

r = requests.get("https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG")

with open("myfile.png", "wb") as file:
    file.write(r.content)

I'm not sure I understand what you're trying to do, but maybe you want to use formatted strings to build your URLs (https://docs.python.org/3/library/stdtypes.html?highlight=format#str.format) ?

Maybe checking string indexes is fine in your case (if x[0:4] == "http":), but I think you should check python re package to use regular expressions to catch the elements you want in a document (https://docs.python.org/3/library/re.html).

import re

regex = re.compile(r"^http://")

if re.match(regex, mydocument):
    <do something>

Is there a way in the "with open" section of requests to not have to specify a name for the file? Ultimately I can't get around to concatenating the file name the way I want to eventually if I have to KNOW the file extension. I feel like an idiot. — jss3000, Jul 17 '20 at 14:52
Using requests has allowed me to correctly download a file - now I just need to figure out how to rename file with appropriate extension. Thank you. — jss3000, Jul 17 '20 at 15:00
That's the second part of my answer. Try to use regular expressions to grab your URLs, throw them in a list. Then : ``` for i, url in enumerate(url_list): r = requests.get(url) with open(f"myfile_{i}.png", "wb") as file: file.write(r.content) ``` — RomainM, Jul 17 '20 at 15:01
Don't feel like an idiot, take your time and look for : - formatted strings (I think that's what you need to change the name of the file you save everytime you download a new file) - regular expressions (to grab the URLs in your PDF files) I linked those in the answer. — RomainM, Jul 17 '20 at 15:04

score 0 · Answer 2 · answered Jul 17 '20 at 15:50

The reason for this behavior is inside wget library. Inside it encodes the URL with urllib.parse.quote() (https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote).

Basically it replaces characters with their appropriate %xx escape character. Your URL is already escaped but the library does not know that. When it parses the %20 it sees % as a character that needs to be replaced so the result is %2520 and different URL - therefore 403 error.

You could decode that URL first and then pass it, but then you would have another problem with this library because your URL has parameter filename*= but the library expects filename=.

I would recommend doing something like this:

# get the file
req = requests.get(parsed_url)

# parse your URL to get GET parameters
get_parameters = [x for x in parsed_url.split('?')[1].split('&')]

filename = ''
# find the get parameter with the name
for get_parameter in get_parameters:
    if "filename*=" in get_parameter:
        # split it to get the name
        filename = get_parameter.split('filename*=')[1]

# save the file
with open(<path> + filename, 'wb') as file:
    file.write(req.content)

I would also recommend removing the utf-8'' in that filename because I don't think it is actually part of the filename. You could also use regular expressions for getting the filename, but this was easier for me.

How do you correctly parse web links to avoid a 403 error when using Wget?

2 Answers2