5

I am working with 10-Ks from Edgar. To assist in file management and data analysis, I would like to create a table containing the path to each file, the CIK number for the company filed (this is a unique ID issued by SEC), and the SIC industry code which it belongs to. Below is an image visually representing what I want to do.

The two things I want to extract are listed at the top of each document. The CIK # will always be a number which is listed after the phrase "CENTRAL INDEX KEY:". The SIC # will always be a number enclosed in brackets after "STANDARD INDUSTRIAL CLASSIFICATION" and then a description of that particular industry.

This is consistent across all filings.

To do's:

  1. Loop through files: extract file path, CIK and SIC numbers -- with attention that I just get one return per document, and each result is in order, so my records between fields align.

  2. Merge these fields together -- I am guessing the best way to do this is to extract each field into their own separate lists and then merge, maybe into a Pandas dataframe?

Ultimately I will be using this table to help me subset the data between SIC industries.

Thank you for taking a look. Please let me know if I can provide additional documentation.

The two pieces of metadata I want to extract are at the top of each document. The CIK # will always be a number which is listed after the phrase "CENTRAL INDEX KEY:". The SIC # will always be a number enclosed in brackets after "STANDARD INDUSTRIAL CLASSIFICATION" and then a description of that particular industry.

  • I have made a python library on this https://github.com/rahulrrixe/SEC-Edgar. Hope it helps you out unless your question is too broad. – Rahul Apr 17 '17 at 11:35
  • Thank you Rahul. Fortunately I do have them all downloaded but will check out your library for future downloads. This part is my next step. –  Apr 17 '17 at 12:52

1 Answers1

1

Here are some codes I just wrote for doing something similar. You can output the results to a CSV file. As the first step, you need to walk through the folder and get a list of all the 10-Ks and iterate over it.

    year_end = ""
    sic = ""

    with open(txtfile, 'r', encoding='utf-8', errors='replace') as rawfile:
        for cnt, line in enumerate(rawfile):
            #print(line)
            if "CONFORMED PERIOD OF REPORT" in line:
                year_end = line[-9:-1]
                #print(year_end)
            if "STANDARD INDUSTRIAL CLASSIFICATION" in line:
                match = re.search(r"\d{4}", line)
                if match:
                    sic = match.group(0)
                    #print(sic)
                #print(sic)
            if (year_end and sic) or cnt > 100:
                #print(year_end, sic)
                break
Victor Wang
  • 765
  • 12
  • 26