Parse specifically structured .txt file (C#)

Question

I have a file where the structure is table-like and I need to parse the file after which read and map to my POCO classes. The file looks like as following:

 Financial Institution   : LOREMIPSOM      - 019223
 FX Settlement Date      : 10.02.2021
 Reconciliation File ID  : 801-288881-0005543759-00001
 Transaction Currency    : AZN
 Reconciliation Currency : USD



    +--------------------------------------+--------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------+---------------------+
    !         Settlement Category          ! Transaction Amount ! Reconciliation Amnt !          Fee Amount !  Transaction Amount ! Reconciliation Amnt !          Fee Amount !   Count !           Net Value !
    !                                      !             Credit !              Credit !              Credit !               Debit !               Debit !               Debit !   Total !                     !
    +--------------------------------------+--------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------+---------------------+

    ! MC Acq Fin Detail ATM Out                           5.00                   3.57                 49.75                  0.00                  0.00                  0.00        31                  3.32 !
    ! MC Acq Fin Detail Retail Out                        5.40                 262.01                  0.00                  0.00                  0.00                 -3.96        10                258.05 !

    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                      Totals :                           10.40                 265.58                 49.75                  0.00                  0.00                 -3.96        41                261.37



 Financial Institution   : LOREMIPSOM      - 019223
 FX Settlement Date      : 10.02.2021
 Reconciliation File ID  : 801-288881-0005543759-00002
 Transaction Currency    : EUR
 Reconciliation Currency : USD



    +--------------------------------------+--------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------+---------------------+
    !         Settlement Category          ! Transaction Amount ! Reconciliation Amnt !          Fee Amount !  Transaction Amount ! Reconciliation Amnt !          Fee Amount !   Count !           Net Value !
    !                                      !             Credit !              Credit !              Credit !               Debit !               Debit !               Debit !   Total !                     !
    +--------------------------------------+--------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------+---------------------+

    ! Fee Collection Inc                                  0.00                   0.00                  0.00                  0.00                  0.00                  0.00         0                  0.00 !
    ! Fee Collection Inc                                  0.00                   0.00                  0.00                  8.00                  0.00                  0.00         0                  0.00 !
    ! Fee Collection Inc                                  0.00                   0.00                  0.00                  0.00                  0.00                  0.00         0                  0.00 !
    ! Fee Collection Inc                                  0.00                   0.00                  0.00                 -1.00                  0.00                  0.00         0                  0.00 !

    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                      Totals :                            0.00                   0.00                  0.00                  7.00                  0.00                  0.00         0                  0.00

I was thinking parsing it manually, I thought maybe there is a better way.. and about the parsing data so I need every data that is reasonable to parse, so I need to parse the file and get every data except for +- symbols. Also the file structure doesn't change so the columns are always there (fixed). The file as you can see is a bank related file (transations). So There is this "Financial Institution" for example that I map and other data. "Settlement Category" of this "Financial Institution " is "MC Acq Fin Detail ATM Out" for example. What would be the best way to parse the file?

Good question, I've never found a tool for this, have no idea how to setup the tool either... so I've always ended up doing it manually... :( — steb, Jun 29 '22 at 13:59
This will be painful... not impossible, especially if you can guarantee the columns in those tables are always in exactly those positions... but still painful. Whatever system generates this, if you can get something less pretty but that will be more consistent (trade human-readability for machine-readability) you'll thank yourself later. — Joel Coehoorn, Jun 29 '22 at 14:00
Because there simply is no tool to do that... at least not with exactly this structure. — MakePeaceGreatAgain, Jun 29 '22 at 14:00
Arguably the best way is to look up the process generating these files, and get it to generate something more amenable to machine reading. It's not impossible that in fact it is already doing that, but you're only getting this format. — Jeroen Mostert, Jun 29 '22 at 14:00
What did you try already? I hope you don't expect anyone here to write you a parser for this. — MakePeaceGreatAgain, Jun 29 '22 at 14:01
Apart from not showing any own attempts you don't even provide what data specifically you're interested in. How should anyone here know what information from this rubberish data is meaningful and thus how it should be interpreted? — MakePeaceGreatAgain, Jun 29 '22 at 14:04
It will take a lot of `IndexOf()` to find specific known-fixed parts and then `Substring()` to get the information you want. Investigate what parts are fixed and where you can find the information you need. — Hans Keﬆing, Jun 29 '22 at 14:06
@MakePeaceGreatAgain Haha , No I don't expect anyone to write a parser. I was thinking parsing it manually, I thought maybe there is a better way.. and about the parsing data so I need every data that is reasonable to parse, so I need to parse the file and get every data except for +- symbols — VahidDev, Jun 29 '22 at 14:07
Also the file structure doesn't change so the columns are always there (fixed). The file as you can see is a bank related file (transations). so There is this "Financial Institution" for example that I map and other data. "Settlement Category" of this "Financial Institution " is "MC Acq Fin Detail ATM Out" for example — VahidDev, Jun 29 '22 at 14:12
My guess is that "Settlement Category" is a fixed header of a fixed column. You only want the table-content in the column below that (delimited by "----" lines, and possibly empty lines). Same for the other columns. Also "Financial Institution" seems fixed, you want the text after the ":" — Hans Keﬆing, Jun 29 '22 at 14:12
Agree with others - this particular file is human oriented, not machine oriented. Ask for a more appropriate file for automated consumption. If you cannot get that, you're in a real bind, because human formats can be changed at whim to reflect new tastes. Machine oriented formats aren't subject to such whims. — Damien_The_Unbeliever, Jun 29 '22 at 17:23

score 1 · Accepted Answer · answered Jun 29 '22 at 20:55

You can probably do it by parsing one line at a time with Regular Expressions. With RegEx and having some known pattern to look for, you can apply whatever the current line is to a RegEx.Match() call and it will return a list of all the parts that are captured within parenthesis groups. This prevents you from having to keep doing complex IndexOf() searching and such along the way.

If the result returns the expected segment groups vs no entries, you should be good to pull the pieces out rather quickly. Having multiple patterns defined should help cycle through on which version has the context you are looking for.

One such tool to test what you are planning on parsing can be found at doing inline Regular Expression samples and testing expressions to see how they work, AND it allows you to debug and step through while describing what it is looking for. You can post the patterns I have in the code to see how they are described, and debug by putting in some sample text from your sample file

This link from StackOverflow also helped for getting possiblity of multiple words before next "marker" identifying section break to next part

Here is a quick something I threw together for you. Hopefully for you and others, it can help identify a parsing mechanism vs the complexity of parsing and looking for all the Index of, string extract, next parsing, etc. Learning how to do patterns can take time, but hopefully I have done enough in-line documentation to help you see that it is not AS difficult as one might think.

Good luck.

    private void TryRegParse()
    {

        if (!File.Exists("TestingRegex.txt"))
            return;

        // read the text content into already parsed individual lines
        var txtLines = File.ReadAllLines("TestingRegex.txt");

        // the "*" indicates zero or more spaces before whatever is following it.
        var patFinancial = @"^.*?Financial Institution.*?:.*?(?<FinInst>.+?-).*?(?<FinAccnt>.*)";
        // Explanation of what I have here for the pattern
        // ^ = start of the string
        // .*? = zero OR more possible white space/tab charaters
        // Financial Institution = find this exact string
        // .*?:  = there may be zero or more white-space/tab before coming up to the ":" character
        // .*? and additional check for zero or more white spaces
        // (?<FinInst>.+?-) = 
        //  using the outer (parens) allows Regular expression to pull the extracted portion into a group results
        //      the ?<FinInst> allows this "group" to be recognized by the name "FinInst" see shortly
        //      . indicates a single character 
        //      the +?- means keep look ahead from where you are now for UNTIL you get to the - character (whatever appears after the ?)
        //      This allows you to get multiple possible word(s) / names up to the actual hyphen
        //      .*?:  = another instance there may be zero or more white-space/tab before the final data
        //      (?<FinAccnt>.*) = parens indicate another group, similarly named like ?<FinInst> above 

        // create a regular expression object of just this specific pattern
        var RegExFinInst = new Regex( patFinancial );


        // Now, prepare another string line to parse and its regular expression object to match against.
        // for Dates, https://regexland.com/regex-dates/ had a good clarification, but since your dates
        // appear in month.day.year format, I had to alter  
        var patFXSettlement = @"^.*?FX Settlement Date.*?:.*?(?<sMonth>(0[1-9]|1[0-2])).(?<sDay>(0[1-9]|[12][0-9]|3[01])).(?<sYear>\d{4})";
        // each pattern, just creating a regular expression of its corresponding pattern to match
        var RegSettle= new Regex(patFXSettlement);

        // same here on last 2 samples
        var patReconFile = @"^.*?Reconciliation File ID.*?:.*?(?<FileId>.*)";
        var RegRecon= new Regex(patReconFile);

        var patTxnCurr = @"^.*?Transaction Currency.*?:.*?(?<Currency>[A-Z]{3}).*";
        var RegTxnCurr = new Regex(patTxnCurr);

        // go through each line
        foreach ( var s in txtLines )
        {
            // see if the current line "matches" the Financial Institution pattern
            // As you can see from the "named" groups, you can get without having to
            // know what ordinal number the group is within the expression, you can get by its name
            var hasMatch = RegExFinInst.Match(s);
            if( hasMatch.Success )
            {
                MessageBox.Show("Financial Institution Group: " + hasMatch.Groups["FinInst"] + "\r\n"
                            + "Account: " + hasMatch.Groups["FinAccnt"]);
                // done with this line
                continue;
            }

            // if not, try the next, and next and next
            hasMatch = RegSettle.Match(s);
            if( hasMatch.Success )
            {
                MessageBox.Show("FX Settlement Month: " + hasMatch.Groups["sMonth"]
                        + "  Day: " + hasMatch.Groups["sDay"]
                        + " Year: " + hasMatch.Groups["sYear"] );
                // done with this line
                continue;

            }

            hasMatch = RegRecon.Match(s);
            if (hasMatch.Success)
            {
                MessageBox.Show("Reconcilliation File: " + hasMatch.Groups["FileId"] );
                // done with this line
                continue;

            }

            hasMatch = RegTxnCurr.Match(s);
            if (hasMatch.Success)
            {
                MessageBox.Show("Transaction Currency: " + hasMatch.Groups["Currency"]);
                // done with this line
                continue;

            }
        }

    }

Thank you very much! But I have just now finished parsing it and I used basic splitting by '!', '\r','\n' and " ". Parsed the file and mapped to my poco classes. — VahidDev, Jun 29 '22 at 21:54

VahidDev · Answer 2 · 2022-06-29T22:27:42.343

I ended up parsing it manually. So as I said the structure is always the same. And I use fail fast technique where if there is something wrong I just throw an exception

IRuntimeServices runtimeServices = new RuntimeServices();

    List<string> transactionTitles = new();
    List<string> transactionDetails = new();
    string constText = "Financial Institution";
    bool isTitleFinished = false;
    int counterTable = 0;
    int counterTitle = 0;

    for (int i = 0, j = i; i < text.Length; i++)
    {
        if (text[i] == '+' && !isTitleFinished)
        {
            Helper.AddItem(transactionTitles, text, j, counterTitle);
            isTitleFinished = true;
            j = i;
            counterTitle = 0;
        }
        else if(!isTitleFinished && text[i] != '+')
        {
            counterTitle ++;
        }

        if (isTitleFinished)
        {
            if (text.Length >= i + constText.Length || text.IsLastIndex(i))
            {
                if(text.IsLastIndex(i))
                {
                    Helper.AddItem(transactionDetails, text, j,null);
                }
                else if (text.IsSubStrEqualToSpecificStr(i,constText))
                {
                    Helper.AddItem(transactionDetails, text, j, counterTable);
                    isTitleFinished = false;
                    counterTable = 0;
                    j = i;
                }
                else
                {
                    counterTable++;
                }
            }
        }
    }

    ICollection<Transaction> transactions = new List<Transaction>();

    for (int i = 0; i < transactionTitles.Count; i++)
    {
        string[] titlePairs = transactionTitles[i]
            .Trim()
            .Split(new char[] { '\n', '\r' }, 
            StringSplitOptions.RemoveEmptyEntries); 
        
        Dictionary<string, string> transactionTitlesDict = new ();
        for (int j = 0; j < titlePairs.Length; j++)
        {
            string[] nameAndValue = titlePairs[j].Split(":");
            transactionTitlesDict.Add(nameAndValue[0].Trim(), nameAndValue[1].Trim());
        }
         Transaction transaction = runtimeServices
            .CreateCustomObject<Transaction>(transactionTitlesDict);


        string[] detailPairs = transactionDetails[i]
            .Trim()
            .Split(new char[] { '\n', '\r' },
            StringSplitOptions.RemoveEmptyEntries);

        string[] detailTitlesPart1 = detailPairs[1]
            .Trim()
            .Split(new char[] { '\n', '\r','!' },
            StringSplitOptions.RemoveEmptyEntries);

        string[] detailTitlesPart2 = detailPairs[2]
            .Trim()
            .Split(new char[] { '\n', '\r', '!' },
            StringSplitOptions.RemoveEmptyEntries);

        IList<string> transactionDetailsTitles = new List<string>();

        if(detailTitlesPart1.Length != detailTitlesPart2.Length)
        {
            throw new Exception("Invalid format");
        }

        for (int p = 0; p < detailTitlesPart1.Length; p++)
        {
            transactionDetailsTitles
                .Add($"{detailTitlesPart1[p].Trim()} {detailTitlesPart2[p].Trim()}");
        }

        IList<string[]> transactionDetailsData = new List<string[]>();

        for (int k = 4; k < detailPairs.Count() - 2; k++)
        {
            string[] data = detailPairs[k]
                .Trim()
                .Split(new[] { "  ","!" },
                StringSplitOptions.RemoveEmptyEntries);
            transactionDetailsData.Add(data);
        }

        Dictionary<string, string> transactionDetailsDict = new();

        foreach (string[] transactionDetailDataRow in transactionDetailsData)
        {
            for (int l = 0; l < transactionDetailsTitles.Count; l++)
            {
                if (transactionDetailDataRow.Count() != transactionDetailsTitles.Count)
                {
                    throw new Exception("Invalid format");
                }
                transactionDetailsDict
                    .Add(transactionDetailsTitles[l].Trim(), transactionDetailDataRow[l].Trim());
            }
        // Don't pay attention to this part
            SettlementDetail settlementDetail = runtimeServices
                .CreateCustomObject<SettlementDetail>(transactionDetailsDict);
            transaction.SettlementDetails.Add(settlementDetail);
            transactionDetailsDict.Clear();
        }
        transactions.Add(transaction);
    }

Parse specifically structured .txt file (C#)

2 Answers2