Randomly damaged pdf files when using requests.get() with Python to download pdf

Question

Thank you for reading my post. I have a list of urls for pdf files.

for eachurl in url_list:
    print(eachurl)

Below are the links for my pdfs:

https://www.sec.gov/Archives/edgar/data/1005757/999999999715000035/filename1.pdf https://www.sec.gov/Archives/edgar/data/1037760/999999999715000162/filename1.pdf https://www.sec.gov/Archives/edgar/data/1038133/999999999715000169/filename1.pdf https://www.sec.gov/Archives/edgar/data/1009626/999999999715000483/filename1.pdf https://www.sec.gov/Archives/edgar/data/1017491/999999999715000518/filename1.pdf https://www.sec.gov/Archives/edgar/data/1020214/999999999715000557/filename1.pdf https://www.sec.gov/Archives/edgar/data/1020214/999999999715000795/filename1.pdf

These seven links work perfectly if I mannually click on them and download the pdf file. However, if I use python codes to download them, random error happens. Sometimes, the first pdf is damaged and cannot be opened. Sometime. it is the second, or third, etc...

from pathlib import Path
import requests
n_files = 0
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169'}
for eachurl in url_list:
    n_files += 1
    response = requests.get(eachurl, headers=headers)
    filename = Path(str(n_files) + '.pdf')
    filename.write_bytes(response.content)

Could you help me understand why this happens?

Update: I uploaded these files to google drive, and finnaly found out that it is because SEC identifies me as a robot. I have added the headers. Any idea how to bypass this? Google Drive

In case of damaged downloads, have you compared that download with the undamaged variant? Or can you share such a damaged download, e.g. by uploading it to googledrive or github, sharing it publicly, and posting the URL here? — mkl, Aug 24 '21 at 15:06
@mkl Thank you. I have uploaded it the google drive and it seems that it is because SEC detects me as an automated tool. — Jacob Ho, Aug 24 '21 at 15:36

Sujal Singh · Answer 1 · 2021-08-24T15:35:56.153

There is nothing wrong with your code. It's just that the website you are downloading the pdf documents from, detects you are using an automated tool and instead of providing you with a pdf like it normally would, it returns an html page informing you of the above.

Your Request Originates from an Undeclared Automated Tool

To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic.

Please declare your traffic by updating your user agent to include company specific information.

For best practices on efficiently downloading information from SEC.gov, including the latest EDGAR filings, visit sec.gov/developer. You can also sign up for email updates on the SEC open data program, including best practices that make it more efficient to download data, and SEC.gov enhancements that may impact scripted downloading processes. For more information, contact opendata@sec.gov.

For more information, please see the SEC’s Web Site Privacy and Security Policy. Thank you for your interest in the U.S. Securities and Exchange Commission.
Reference ID: 0.2420b07b.1629818487.2ac196c

More Information

Internet Security Policy

By using this site, you are agreeing to security monitoring and auditing. For security purposes, and to ensure that the public service remains available to users, this government computer system employs programs to monitor network traffic to identify unauthorized attempts to upload or change information or to otherwise cause damage, including attempts to deny service to users.

Unauthorized attempts to upload information and/or change information on any portion of this site are strictly prohibited and are subject to prosecution under the Computer Fraud and Abuse Act of 1986 and the National Information Infrastructure Protection Act of 1996 (see Title 18 U.S.C. §§ 1001 and 1030).

To ensure our website performs well for all users, the SEC monitors the frequency of requests for SEC.gov content to ensure automated searches do not impact the ability of others to access SEC.gov content. We reserve the right to block IP addresses that submit excessive requests. Current guidelines limit users to a total of no more than 10 requests per second, regardless of the number of machines used to submit requests.

If a user or application submits more than 10 requests per second, further requests from the IP address(es) may be limited for a brief period. Once the rate of requests has dropped below the threshold for 10 minutes, the user may resume accessing content on SEC.gov. This SEC practice is designed to limit excessive automated searches on SEC.gov and is not intended or expected to impact individuals browsing the SEC.gov website.

Note that this policy may change as the SEC manages SEC.gov to ensure that the website performs efficiently and remains available to all users.

Note: We do not offer technical support for developing or debugging scripted downloading processes.

SOLUTION

Remove the headers, seems to be working fine after that.

Thank you! I just figured this out, too. Do you know how to bypass this? I thought adding headers would help, but it didn't. — Jacob Ho, Aug 24 '21 at 15:34
Thank you for helping me, Sujal. I tried, but it didn't work for me. I guess it is still random? I tried without headers, and still some (random) of them is forbidd by SEC website. — Jacob Ho, Aug 24 '21 at 15:42
They mention a request limit per IP address, but I think it has to do with the headers since no matter how many times I refresh in the browser (cache disabled) it still loads up correctly, you could just open up chrome dev tools and copy every request header from there... — Sujal Singh, Aug 24 '21 at 15:47
Thank you so much. I forgot to reply previously. It was because of the headers. I changed the headers and the codes worked! — Jacob Ho, Sep 19 '21 at 22:26
I think you should add what you changed in the headers to get it working as an answer. — Sujal Singh, Sep 20 '21 at 04:13
Got it. Thanks. I have updated my post to include the answer. — Jacob Ho, Sep 21 '21 at 15:28

Randomly damaged pdf files when using requests.get() with Python to download pdf

1 Answers1

Your Request Originates from an Undeclared Automated Tool

More Information

Internet Security Policy

SOLUTION