3

Super noob here who had a friend help me make this webscraper for looking at hedge fund 13fs. It was working fine previously but recently I've been getting this error:

response_two = get_request(sec_url + tags[0]['href'])

IndexError: list index out of range

I don't understand why this index isn't working anymore. I've been trying to figure it out by going on the browser console while on the SEC site but I'm having a hard time figuring it out.

Here is the full code:

import requests
import re
import csv
import lxml
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
sec_url = 'https://www.sec.gov'

def get_request(url):
    return requests.get(url)

def create_url(cik):
    return 'https://www.sec.gov/cgi-bin/browse-edgar?CIK={}&owner=exclude&action=getcompany&type=13F-HR'.format(cik)

def get_user_input():
    cik = input("Enter CIK number:")
    return cik

requested_cik = get_user_input()

# Find mutual fund by CIK number on EDGAR
response = get_request(create_url(requested_cik))
soup = BeautifulSoup(response.text, "html.parser")
tags = soup.findAll('a', id="documentsbutton")

# Find latest 13F report for mutual fund
response_two = get_request(sec_url + tags[0]['href'])
soup_two = BeautifulSoup(response_two.text, "html.parser")
tags_two = soup_two.findAll('a', attrs={'href': re.compile('xml')})
xml_url = tags_two[3].get('href')
response_xml = get_request(sec_url + xml_url)
soup_xml = BeautifulSoup(response_xml.content, "lxml")

# DataFrame
df = pd.DataFrame()
df['companies'] = soup_xml.body.findAll(re.compile('nameofissuer'))
df['value'] = soup_xml.body.findAll(re.compile('value'))

for row in df.index:
    df.loc[row, 'value'] = df.loc[row, 'value'].text
    df.loc[row, 'companies'] = df.loc[row, 'companies'].text
df['value'] = df['value'].astype(float)
df = df.groupby('companies').sum()
df = df.sort_values('value',ascending=False)
for row in df.index:
    df.loc[row, 'allocation'] = df.loc[row, 'value']/df['value'].sum()*100
df['allocation'] = df['allocation'].astype(int)
df = df.drop('value', axis=1)
df

Thank you so very much!

  • Could you give a CIK number that you know worked previously with this script? It would help with reproducing the error that you describe. – BrokenBenchmark Dec 30 '21 at 00:42
  • Sure thing, here is one: 1578684 – Ryan Reeves Dec 30 '21 at 01:18
  • Could you also give an example URL for what a 13F report looks like? I've gotten past the issue @HedgeHog described, but I'm not sure about the missing documentbutton element. – BrokenBenchmark Dec 30 '21 at 01:32
  • Thank you very much. Here is a typical link: https://www.sec.gov/cgi-bin/browse-edgar?CIK={1578684}&owner=exclude&action=getcompany&type=13F-HR – Ryan Reeves Dec 30 '21 at 01:54

3 Answers3

1

There's two issues with the script:

  1. The SEC added rate limiting to their website. You aren't alone in facing this issue.. To resolve this, use the fix that HedgeHog described.

  2. (Not an actual issue -- see the follow-ups.) The id of the button you're looking for is "documentbuttons" (plural), rather than "documentbutton" (singular). So you need to change the id of the HTML element that you're looking for.

This:

tags = soup.findAll('a', id="documentbutton")

should be this:

tags = soup.findAll('a', id="documentsbutton")

The errors should be gone! (That being said, I can't verify that the dataframe code will work with these requests, since it is cut off in the original post.)

BrokenBenchmark
  • 18,126
  • 7
  • 21
  • 33
  • Thank you very much but the original code has "documents" as a plural. – Ryan Reeves Dec 30 '21 at 02:14
  • Huh, I must have copied the code wrong. My bad! – BrokenBenchmark Dec 30 '21 at 02:32
  • No worries! Thanks for your help. Unfortunately, even though I'm running the user-agent snippet, I'm still getting the same error. – Ryan Reeves Dec 30 '21 at 04:18
  • Can you run the script with `print(response.status_code)` above the line `response = get_request(create_url(requested_cik))`, and tell me what the output is? – BrokenBenchmark Dec 30 '21 at 04:40
  • The output was 403. But I'm still getting the original error message. – Ryan Reeves Dec 30 '21 at 04:46
  • Weird, I'm getting a 200 (request OK, 403 means Forbidden) on my end. The only thing that would be different at this point would be our IP addresses -- you may want to consider running this script when connected to a different network (e.g. your local library). As much as I'd like to help, it looks like this might be the result of something on the SEC's side, and so there's not much more I can do here. – BrokenBenchmark Dec 30 '21 at 04:56
  • 1
    Well thanks so much for your care and concern! I appreciate you! – Ryan Reeves Dec 31 '21 at 05:09
  • No problem! Happy to help :) – BrokenBenchmark Dec 31 '21 at 05:49
0

Seeing as tags[0] raises IndexError, your problem seems to be that tags is an empty list []

This means thatsoup.findAll is not finding any <a> tags with id=documentsButton in your soup

This could be caused by a typo in the URL, CIK number, or the element id you are searching for.

Seeing as I can't access www.sec.gov, I am not going to try to, meaning I can only provide help for things not pertaining to it

RossM
  • 438
  • 4
  • 10
  • Thank you. The problem is that I'm nearly certain the URL, CIK and element ID are correct. I've quadruple-checked them. It seems like the index[0] isn't reading the correct thing. For example, here is the link I've been using: https://www.sec.gov/cgi-bin/browse-edgar?CIK={1578684}&owner=exclude&action=getcompany&type=13F-HR – Ryan Reeves Dec 30 '21 at 01:29
0

What happens?

Always look at your soup first - therein lies the truth. The content can always be slightly to extremely different from the view in the dev tools.

U.S. Securities and Exchange Commission

Your Request Originates from an Undeclared Automated Tool

To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic.

Please declare your traffic by updating your user agent to include company specific information.

For best practices on efficiently downloading information from SEC.gov, including the latest EDGAR filings, visit sec.gov/developer. You can also sign up for email updates on the SEC open data program, including best practices that make it more efficient to download data, and SEC.gov enhancements that may impact scripted downloading processes. For more information, contact opendata@sec.gov.

For more information, please see the SECâs Web Site Privacy and Security Policy. Thank you for your interest in the U.S. Securities and Exchange Commission.

Reference ID: xxxxxxxxxxxxxxxxxxx.xxxxxxxxxx.xxxxxxxx

More Information

Internet Security Policy

By using this site, you are agreeing to security monitoring and auditing. For security purposes, and to ensure that the public service remains available to users, this government computer system employs programs to monitor network traffic to identify unauthorized attempts to upload or change information or to otherwise cause damage, including attempts to deny service to users.

Unauthorized attempts to upload information and/or change information on any portion of this site are strictly prohibited and are subject to prosecution under the Computer Fraud and Abuse Act of 1986 and the National Information Infrastructure Protection Act of 1996 (see Title 18 U.S.C. §§ 1001 and 1030).

To ensure our website performs well for all users, the SEC monitors the frequency of requests for SEC.gov content to ensure automated searches do not impact the ability of others to access SEC.gov content. We reserve the right to block IP addresses that submit excessive requests. Current guidelines limit users to a total of no more than 10 requests per second, regardless of the number of machines used to submit requests.

If a user or application submits more than 10 requests per second, further requests from the IP address(es) may be limited for a brief period. Once the rate of requests has dropped below the threshold for 10 minutes, the user may resume accessing content on SEC.gov. This SEC practice is designed to limit excessive automated searches on SEC.gov and is not intended or expected to impact individuals browsing the SEC.gov website.

Note that this policy may change as the SEC manages SEC.gov to ensure that the website performs efficiently and remains available to all users.

How to fix?

You can add an user-agent to your request - But you should respect the websites policies.

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}


def get_request(url):
    return requests.get(url,headers=headers)
HedgeHog
  • 22,146
  • 4
  • 14
  • 36