I am working with 10-Ks from Edgar. To assist in file management and data analysis, I would like to create a table containing the path to each file, the CIK number for the company filed (this is a unique ID issued by SEC), and the SIC industry code which it belongs to. Below is an image visually representing what I want to do.
The two things I want to extract are listed at the top of each document. The CIK # will always be a number which is listed after the phrase "CENTRAL INDEX KEY:". The SIC # will always be a number enclosed in brackets after "STANDARD INDUSTRIAL CLASSIFICATION" and then a description of that particular industry.
This is consistent across all filings.
To do's:
Loop through files: extract file path, CIK and SIC numbers -- with attention that I just get one return per document, and each result is in order, so my records between fields align.
Merge these fields together -- I am guessing the best way to do this is to extract each field into their own separate lists and then merge, maybe into a Pandas dataframe?
Ultimately I will be using this table to help me subset the data between SIC industries.
Thank you for taking a look. Please let me know if I can provide additional documentation.