0

I have a lot tsv files after Tesseract OCR. Mainly a such kind:

txt
txt
101102,13
dollars
txt
201
202,23
dollars
txt
301

302,33
dollars
txt
txt
txt
401402
dollars
43
cents
txt
501
502
dollars
53
cents
txt
txt
601

602
dollars
63
cents
txt

(This is not real OCR file. For research parsing I show 11th field only. The txt value mean a few any symbols except all digits or digits and colon.)

With a terrible effort of will I found that there is simple grammar for all cases my OCR files:

D$         ( 101102,13\ndollars )
DD$        ( 201\n202,23\ndollars )
DND$       ( 301\n\n302,33\ndollars )
D$DC       ( 401402\ndollars\n43\ncents )
DD$DC      ( 501502\ndollars\n53\ncents )
DND$DC     ( 601\n\n602\ndollars\n63\ncents )

where, required symbols: D - digits - only digits (sometimes with comma), $ - dollars - word

optional: C - cents - word, T - text - any symbols non digit N - "" - \n

So it is a dictionary for grammar.

This is my parsing function:

void Parsing()
{
    string prepre    = "";
    string pre       = ""; // previous
    string cur       = ""; // current
    string nex       = ""; // next
    string nexnex    = "";
    string nexnexnex = "";

    for (int nline = 2; nline < nlines_-3; nline++)
    {
        prepre       = tsv_array.GetValue(nline - 2, 11).ToString();
        pre          = tsv_array.GetValue(nline - 1, 11).ToString();
        cur          = tsv_array.GetValue(nline, 11).ToString();
        nex          = tsv_array.GetValue(nline + 1, 11).ToString();
        nexnex       = tsv_array.GetValue(nline + 2, 11).ToString();
        nexnexnex    = tsv_array.GetValue(nline + 3, 11).ToString();
        richTextBox1.AppendText( cur + "\n");

//----------- RULES ----------------------------------
        if (!NUM(prepre) & !NUM(pre) & NUM(cur) & DOL(nex) & !NUM(nexnex) & !CEN(nexnexnex))    Numbers.Add(cur);
        if (NUM(pre) & NUM(cur) & DOL(nex) & !NUM(nexnex) & !CEN(nexnexnex))                    { cur = pre + cur; Numbers.Add(cur); }
        if (NUM(prepre) & !NUM(pre) & NUM(cur) & DOL(nex) & !NUM(nexnex) & !CEN(nexnexnex))     { cur = prepre + cur; Numbers.Add(cur); }
        if (!NUM(prepre) & !NUM(pre) & NUM(cur) & DOL(nex) & NUM(nexnex) & CEN(nexnexnex))      { cur = cur + "," + nexnex; Numbers.Add(cur); }
        if (NUM(pre) & NUM(cur) & DOL(nex) & NUM(nexnex) & CEN(nexnexnex))                      { cur = pre + cur + "," + nexnex; Numbers.Add(cur); }
        if (NUM(prepre) & !NUM(pre) & NUM(cur) & DOL(nex) & NUM(nexnex) & CEN(nexnexnex))       { cur = prepre + cur + "," + nexnex; Numbers.Add(cur); }
//----------- RULES ----------------------------------
    } // for 


    bool NUM(string num) // if number
    {
        num = num.Replace(",", "");
        Int32.TryParse(num, out int n);
        if (n > 0) return true;
        return false;
    }
    bool DOL(string num) // if "dollars"
    {
            if (num == "dollars") return true;
        return false;
    }
    bool CEN(string num) // if "cents"
    {
        if (num == "cents") return true;
        return false;
    }
}

This is what I get in List:

101102,13
201202,23
301302,33
401402,43
501502,53
601602,63

All works fine, but what if I need add the rules? For instance I have more complex data:

txt
701702
(seven
hundred
...
two)
dollars
73
cents
txt
801
802
(eight
hundred
...
two)
dollars
83
cents

txt
901

902
(nine
hundred
one
and
...
two)
dollars
93
cents

ok, I add next rules:

               if ( !NUM(prepre) & !NUM(pre) & NUM(cur) & BRO(nex) )
                { 
                    for (int nline_ = nline; nline_ < nlines_ - 3; nline_++)
                    {
                        string str = tsv_array.GetValue(nline_, 11).ToString();

                        if (DOL(str))
                        {
                            string cen = tsv_array.GetValue(nline_ + 2, 11).ToString();
                            if (cen == "cents")
                            {
                                nexnex = tsv_array.GetValue(nline_ + 1, 11).ToString();
                                cur = cur + "," + nexnex;
                                Numbers.Add(cur);
                                break;
                            }
                        }
                    }
                }

                if ( !NUM(prepre) & NUM(pre) & NUM(cur) & BRO(nex)) 
                {
                    for (int nline_ = nline; nline_ < nlines_ - 3; nline_++)
                    {
                        string str = tsv_array.GetValue(nline_, 11).ToString();

                        if (DOL(str))
                        {
                            string cen = tsv_array.GetValue(nline_ + 2, 11).ToString();
                            if (cen == "cents")
                            {
                                nexnex = tsv_array.GetValue(nline_ + 1, 11).ToString();
                                cur = pre + cur + "," + nexnex;
                                Numbers.Add(cur);
                                break;
                            }
                        }
                    }
                }

                if (NUM(prepre) & !NUM(pre) & NUM(cur) & BRO(nex))
                {
                    for (int nline_ = nline; nline_ < nlines_ - 3; nline_++)
                    {
                        string str = tsv_array.GetValue(nline_, 11).ToString();

                        if (DOL(str))
                        {
                            string cen = tsv_array.GetValue(nline_ + 2, 11).ToString();
                            if (cen == "cents")
                            {
                                nexnex = tsv_array.GetValue(nline_ + 1, 11).ToString();
                                cur = prepre + cur + "," + nexnex;
                                Numbers.Add(cur);
                                break;
                            }
                        }
                    }
                }
           
bool BRO(string num)
{
   if (num.Contains("(")) return true;
   return false;
}

And it is works fine again:

701702,73
801802,83
901902,93

But my code is very complex now.

I hope that there is a simplest universal method like Finish State Automat or table filter which they use in compilers.

Added:

I found Сосо/R
https://ssw.jku.at/Research/Projects/Coco/Doc/UserManual.pdf
(compiler generator using C#).

Tell me please:

  • if it suitable for my task?
  • Is it not very old? May be there is a newest program and methods?
  • If it suitable, can anybody give me the simple sample how to convert my grammar to Coco/R input files for I'll got the same output result as now?
Anri
  • 11
  • 2
  • Yes there are. lr parsers and ll parsers. You can build your own but better search for a C# parser library. There is also NLP. – bolov Aug 31 '22 at 11:21
  • Lookup YACC (Yet anther compiler compile) which is unix and uses LEX (similar to REGEX). There are windows version of YACC that are available. – jdweng Aug 31 '22 at 11:42
  • https://stackoverflow.com/questions/540593/lex-yacc-for-c – bolov Aug 31 '22 at 14:40
  • Please look at my updated topic again about Coco\R. I had read about YACC in 70s and it's too unix and regex for me. – Anri Sep 01 '22 at 05:12

0 Answers0