Web Scraping getting error (HTTP Error 403: Forbidden) using urllib

Question

I'm trying to automate web scraping on SEC / EDGAR financial reports, but getting HTTP Error 403: Forbidden. I've referred to similar Stack Overflow forms and have changed the code accordingly, but no luck so far.

test_URL = https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt

Code that I'm working with

import urllib

def get_data(link):
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

    req = urllib.request.Request(link,headers=hdr)
    
    page = urllib.request.urlopen(req, timeout=10)
    content = page.read().decode('utf-8')
  
    return content

data = get_data(test_URL)

getting the error

HTTPError                                 Traceback (most recent call last)
             return result

~\Anaconda3n\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

I've also tried using requests.get(test_URL) then using BeautifulSoup, but that doesn't return the whole text. Is there any other approach we could follow?

Have you had success with this request before? Sometimes websites block IPs after they have made a huge amount of requests. — borisdonchev, Aug 05 '21 at 11:04
for the first time, it worked but after that, all I'm getting is this error — yuvraj singh, Aug 05 '21 at 11:13

score 1 · Accepted Answer · answered Aug 05 '21 at 11:40

I had no probelems using the request package. I did need to add user-agent as without, I was getting the same issue as you. Try this:

import requests 

test_URL = 'https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt'

def get_data(link):
    hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}

    req = requests.get(link,headers=hdr)
    content = req.content
  
    return content

data = get_data(test_URL)

score 0 · Answer 2 · answered Aug 05 '21 at 11:28

0

You don't need to add any headers to this request. Try this

import requests

r = requests.get('https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt')
print(r.text)

answered Aug 05 '21 at 11:28

mama

2,046
1
7
24

That’s not necessarily true. – chitown88 Aug 05 '21 at 16:29

Web Scraping getting error (HTTP Error 403: Forbidden) using urllib

2 Answers2

Linked