How to parse this OFX file?

Question

this is an original ofx file as it comes from m bank (no worries, theres nothing sensitive, i cut out the middle part with all the transactions)

Open Financial Exchange (OFX) is a data-stream format for exchanging financial information that evolved from Microsoft's Open Financial Connectivity (OFC) and Intuit's Open Exchange file formats.

now i need to parse this. i already saw that question, but this is not a dup because i am interested in how to do this.

i am sure i could figure out some clever regexps that would do the job, but that is ugly and error vulnerable (if the format is changed, some fields may be missing, the formatting/white spaces are different etc etc...)

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>
    <SIGNONMSGSRSV1>
        <SONRS>
            <STATUS>
                <CODE>0
                <SEVERITY>INFO
            </STATUS>
            <DTSERVER>20110420000000[+1:CET]
            <LANGUAGE>ENG
        </SONRS>
    </SIGNONMSGSRSV1>
    <BANKMSGSRSV1>
        <STMTTRNRS>
            <TRNUID>1
            <STATUS>
                <CODE>0
                <SEVERITY>INFO
            </STATUS>
            <STMTRS>
                <CURDEF>EUR
                <BANKACCTFROM>
                    <BANKID>20404
                    <ACCTID>02608983629
                    <ACCTTYPE>CHECKING
                </BANKACCTFROM>
                    <BANKTRANLIST>
                    <DTSTART>20110207
                    <DTEND>20110419
                    <STMTTRN>
                        <TRNTYPE>XFER
                        <DTPOSTED>20110205000000[+1:CET]
                        <TRNAMT>-6.12
                        <FITID>C74BD430D5FF2521
                        <NAME>unbekannt
                        <MEMO>BILLA DANKT  1265P K2 05.02.UM 17.49 
                    </STMTTRN>
                    <STMTTRN>
                        <TRNTYPE>XFER
                        <DTPOSTED>20110207000000[+1:CET]
                        <TRNAMT>-10.00
                        <FITID>C74BE0F90A657901
                        <NAME>unbekannt
                        <MEMO>AUTOMAT  13177 KARTE2 07.02.UM 10:22 
                    </STMTTRN>
............................. goes on like this ........................
                    <STMTTRN>
                        <TRNTYPE>XFER
                        <DTPOSTED>20110418000000[+1:CET]
                        <TRNAMT>-9.45
                        <FITID>C7A5071492D14D29
                        <NAME>unbekannt
                        <MEMO>HOFER DANKT  0408P K2 18.04.UM 18.47 
                    </STMTTRN>
                </BANKTRANLIST>
                <LEDGERBAL>
                    <BALAMT>1992.29
                    <DTASOF>20110420000000[+1:CET]
                </LEDGERBAL>
            </STMTRS>
        </STMTTRNRS>
    </BANKMSGSRSV1>
</OFX>

i currently use this code which gives me the desired result:

<?

$files = array();
$files[] = '***_2011001.ofx';
$files[] = '***_2011002.ofx';
$files[] = '***_2011003.ofx';

system('touch file.csv && chmod 777 file.csv');
$fp = fopen('file.csv', 'w');

foreach($files as $file) {
    echo $file."...\n";
    $content = file_get_contents($file);

    $content = str_replace("\n","",$content);
    $content = str_replace(" ","",$content);

    $regex = '|<STMTTRN><TRNTYPE>(.+?)<DTPOSTED>(.+?)<TRNAMT>(.+?)<FITID>(.+?)<NAME>(.+?)<MEMO>(.+?)</STMTTRN>|';


    echo preg_match_all($regex,$content,$matches,PREG_SET_ORDER)." matches... \n";


    foreach($matches as $match) {
        echo ".";
        array_shift($match);
        fputcsv($fp, $match);
    }
    echo "\n";
}
echo "done.\n";
fclose($fp);

this is really ugly and if this was a valid xml file i would personally kill myself for that, but how to do it better?

Boy, this format *stinks!* I'm surprised it hasn't been featured on thedailywtf yet. — Pekka, Apr 20 '11 at 15:00
i bet they have internal guidlines at microsoft to make it horrible for external programmers to gain business advantage :D — The Surrican, Apr 20 '11 at 15:16
including internal assessment sheets: how many standards did you violate today? how many external formats did you misuse? how many open software did you steal to sell as our own? — The Surrican, Apr 20 '11 at 15:17
@Pekka: it was on TDWTF already: http://img.thedailywtf.com/images/201110/banksod.png — Marek, Nov 02 '11 at 15:25

Sander Marechal · Accepted Answer · 2011-04-20T15:06:43.217

5

Your code seems fine, considering that the file isn't XML ~~or even SGML~~. The only thing you could do is try to make a more generic SAX-like parser. That is, you simply go through the input stream one block at a time (where block can be anything, e.g. a line or simply a set amount of characters). Then, call a callback function every time you encounter an <ELEMENT>. You can even go as fanciful as building a parser class where you can register callback functions that listen to specific elements.

It will be more generic and less "ugly" (for some definition of "ugly") but it will be more code to maintain. Nice to do and nice to have if you need to parse this file format a lot (or in a lot of different variations). If your posted code is the only place you do this then just KISS.

edited Apr 20 '11 at 15:06

answered Apr 20 '11 at 14:51

Sander Marechal

22,978
13
65
96

1

yeah that was actually my first idea, looping through lines and shift internal pointers according to the tag occourence... i bet they have internal guidlines to make it difficult for outsiders to gain a business advantage :D – The Surrican Apr 20 '11 at 14:54
2

@Joe: actually, I need to correct myself. According to Wikipedia OFX is in fact valid SGML. So, you should be able to use any standard SGML parser to parse these files (probably provided that you have the DTD). – Sander Marechal Apr 20 '11 at 15:04

Chris Jones · Answer 2 · 2014-03-02T13:29:26.913

// Load Data String    
    $str = file_get_contents($fLoc);
    $MArr = array(); // Final assembled master array
// Fetch all transactions
    preg_match_all("/<STMTTRN>(.*)<\/STMTTRN>/msU",$str,$m); 
    if ( !empty($m[1]) ) {
        $recArr = $m[1]; unset($str,$m);
        // Parse each transaction record
        foreach ( $recArr as $i => $str ) {
            $_arr = array();
            preg_match_all("/(^\s*<(?'key'.*)>(?'val'.*)\s*$)/m",$str,$m); 
            foreach ( $m["key"] as $i => $key ) {
                $_arr[$key] = trim($m["val"][$i]); // Reassemble array key => val
            }
            array_push($MArr,$_arr);
        }
    }
    print_r($MArr);

score 0 · Answer 3 · answered Jan 08 '20 at 10:43

function close_tags($x)
{
    return preg_replace('/<([A-Za-z0-9.]+)>([^<\r\n]+)/', '<\1>\2</\1>', $x);
}

$ofx = file_get_contents('myfile.ofx');

$body = '<OFX>'.explode('<OFX>', $ofx)[1];   // strip the header

$xml = close_tags($body);   // make valid XML

$reader = new SimpleXMLElement($xml);

foreach($reader->xpath('//STMTTRN') as $txn):   // find and loop through all STMTTRN tags, note the double forward slash

    // get the tag contents by casting as (string) to invoke the SimpleXMLElement::__toString() method

    $trntype = (string)$txn->TRNTYPE;
    $dtposted = (string)$txn->DTPOSTED;
    $trnamt = (string)$txn->TRNAMT;
    $name = (string)$xn->NAME;
    $memo = (string)$txn->MEMO;

endforeach;

How to parse this OFX file?

3 Answers3

Linked