1

I'm trying to automate web scraping on SEC / EDGAR financial reports, but getting HTTP Error 403: Forbidden. I've referred to similar Stack Overflow forms and have changed the code accordingly, but no luck so far.

test_URL = https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt

Code that I'm working with

import urllib

def get_data(link):
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

    req = urllib.request.Request(link,headers=hdr)
    
    page = urllib.request.urlopen(req, timeout=10)
    content = page.read().decode('utf-8')
  
    return content

data = get_data(test_URL)

getting the error

HTTPError                                 Traceback (most recent call last)
             return result

~\Anaconda3n\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

I've also tried using requests.get(test_URL) then using BeautifulSoup, but that doesn't return the whole text. Is there any other approach we could follow?

Zoe
  • 27,060
  • 21
  • 118
  • 148
yuvraj singh
  • 88
  • 2
  • 7

2 Answers2

1

I had no probelems using the request package. I did need to add user-agent as without, I was getting the same issue as you. Try this:

import requests 

test_URL = 'https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt'

def get_data(link):
    hdr = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36'}

    req = requests.get(link,headers=hdr)
    content = req.content
  
    return content

data = get_data(test_URL)
chitown88
  • 27,527
  • 4
  • 30
  • 59
0

You don't need to add any headers to this request. Try this

import requests

r = requests.get('https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt')
print(r.text)
mama
  • 2,046
  • 1
  • 7
  • 24