5

Currently, I'm working on a project where one PHP script grabs an index file from ftp://ftp.sec.gov and places all the company information into the database. The second PHP script then grabs the raw text file from the SEC and saves it locally for processing.

An example of the raw text file can be found here -

ftp://ftp.sec.gov/edgar/data/2488/0000002488-15-000028.txt

An example of what the final result should be can be found here - http://www.sec.gov/Archives/edgar/data/1084869/000143774915020024/flws20150927_10q.htm

The goal is to be able to present the filing in a formatted way just like many companies do, but the problem is I can't seem to figure out how it's done reliably for every filing. Some filings seem to have XML, others seem to have HTML

How would I be able to reliably produce the formatted version of the raw text files?

Current code I have -

$db_hostname = "localhost";
$db_username = "username";
$db_password = "password";
$db_database = "database";
$db_server = mysql_connect($db_hostname, $db_username, $db_password);

if (!$db_server) die("Unable to connect to MySQL: " . mysql_error());

mysql_select_db($db_database)
or die("Unable to select database: " . mysql_error());

$query = "SELECT * FROM company WHERE company = '1 800 FLOWERS COM INC' AND date = '2015-08-06'";
$result = mysql_query($query);
$row = mysql_fetch_row($result);
$file = "ftp://ftp.sec.gov/" . $row[4]; 
$text = file_get_contents($file);   
    if($text === false){
        echo "error downloading file $row[4]\n";
        continue;
    }

$tarray = explode('<SEQUENCE>', $text);             


    for($i = 1; $i < count($tarray); $i++){
        $a = strstr($tarray[$i], '<HTML>');             
        if($a == false)continue;                    //means that there is no html document in this sequence
        $html = strstr($a, '</HTML>', true);
        $html.="</HTML>";

        $running = $running . $html;
    }

        $temp = "cache.htm";
        file_put_contents($temp, $running);

$name = $row[0] . "-" . $row[3] . ".pdf";
$name = str_replace(' ', '_', $name);
//$content = file_get_contents($row[2] . "-" . $row[1] . ".htm");
exec("D://wkhtmltopdf/bin/wkhtmltopdf.exe $temp $name");

unlink($temp);

//echo($row[0] . " created");

?>

  • Can you present some code? Stack Overflow is about helping you to answer unequivocal questions, not a blog format discussion. – SwiftArchitect Nov 29 '15 at 04:17
  • I have added the code. As it stands right now it processes the parts of a document in HTML, but if there are images or XML elements, it ignores them completely. – Benjamin Schulz Nov 29 '15 at 14:22
  • Thank you for your response and effort. I will remove my comments. I hope your issue gets resolved in a timely manner – SwiftArchitect Nov 30 '15 at 04:07

1 Answers1

0

You don't need to use raw text files. You can use sec-api (https://www.npmjs.com/package/sec-api). The package provides a real-time channel to sec.gov EDGAR using websockets - it works with client-side (React, React Native, Angular, Vue, etc.), and server-side (Node.js, etc.) JavaScript.

As soon as a new filing (10K, 10Q, 13D, etc.) is published on EDGAR, the package fires an event, and returns the following data in JSON:

{
  "companyName":"MORGAN STANLEY (0000895421) (Filer)",
  "type":"424B2",
  "description":"FORM 424B2",
  "linkToFilingDetails":"https://www.sec.gov/...014988-index.htm",
  "linkToHtmlAnnouncement":"https://www.sec.gov/...268.htm",
  "announcedAt":"2018-12-26T16:02:32-05:00"
}

linkToFilingDetails points to the HTML file listing all attachments of the filing. linkToHtmlAnnouncement points to the HTML file of the filing itself.

PHP in combination with websockets-enabling plugins (eg Ratchet) can also be used.

Example: enter image description here

I developed the tool. Let me know if you have any feedback and I'm happy to add features.

Jay
  • 1,564
  • 16
  • 24