I have a lot tsv files after Tesseract OCR. Mainly a such kind:
txt
txt
101102,13
dollars
txt
201
202,23
dollars
txt
301
302,33
dollars
txt
txt
txt
401402
dollars
43
cents
txt
501
502
dollars
53
cents
txt
txt
601
602
dollars
63
cents
txt
(This is not real OCR file. For research parsing I show 11th field only. The txt value mean a few any symbols except all digits or digits and colon.)
With a terrible effort of will I found that there is simple grammar for all cases my OCR files:
D$ ( 101102,13\ndollars )
DD$ ( 201\n202,23\ndollars )
DND$ ( 301\n\n302,33\ndollars )
D$DC ( 401402\ndollars\n43\ncents )
DD$DC ( 501502\ndollars\n53\ncents )
DND$DC ( 601\n\n602\ndollars\n63\ncents )
where, required symbols: D - digits - only digits (sometimes with comma), $ - dollars - word
optional: C - cents - word, T - text - any symbols non digit N - "" - \n
So it is a dictionary for grammar.
This is my parsing function:
void Parsing()
{
string prepre = "";
string pre = ""; // previous
string cur = ""; // current
string nex = ""; // next
string nexnex = "";
string nexnexnex = "";
for (int nline = 2; nline < nlines_-3; nline++)
{
prepre = tsv_array.GetValue(nline - 2, 11).ToString();
pre = tsv_array.GetValue(nline - 1, 11).ToString();
cur = tsv_array.GetValue(nline, 11).ToString();
nex = tsv_array.GetValue(nline + 1, 11).ToString();
nexnex = tsv_array.GetValue(nline + 2, 11).ToString();
nexnexnex = tsv_array.GetValue(nline + 3, 11).ToString();
richTextBox1.AppendText( cur + "\n");
//----------- RULES ----------------------------------
if (!NUM(prepre) & !NUM(pre) & NUM(cur) & DOL(nex) & !NUM(nexnex) & !CEN(nexnexnex)) Numbers.Add(cur);
if (NUM(pre) & NUM(cur) & DOL(nex) & !NUM(nexnex) & !CEN(nexnexnex)) { cur = pre + cur; Numbers.Add(cur); }
if (NUM(prepre) & !NUM(pre) & NUM(cur) & DOL(nex) & !NUM(nexnex) & !CEN(nexnexnex)) { cur = prepre + cur; Numbers.Add(cur); }
if (!NUM(prepre) & !NUM(pre) & NUM(cur) & DOL(nex) & NUM(nexnex) & CEN(nexnexnex)) { cur = cur + "," + nexnex; Numbers.Add(cur); }
if (NUM(pre) & NUM(cur) & DOL(nex) & NUM(nexnex) & CEN(nexnexnex)) { cur = pre + cur + "," + nexnex; Numbers.Add(cur); }
if (NUM(prepre) & !NUM(pre) & NUM(cur) & DOL(nex) & NUM(nexnex) & CEN(nexnexnex)) { cur = prepre + cur + "," + nexnex; Numbers.Add(cur); }
//----------- RULES ----------------------------------
} // for
bool NUM(string num) // if number
{
num = num.Replace(",", "");
Int32.TryParse(num, out int n);
if (n > 0) return true;
return false;
}
bool DOL(string num) // if "dollars"
{
if (num == "dollars") return true;
return false;
}
bool CEN(string num) // if "cents"
{
if (num == "cents") return true;
return false;
}
}
This is what I get in List:
101102,13
201202,23
301302,33
401402,43
501502,53
601602,63
All works fine, but what if I need add the rules? For instance I have more complex data:
txt
701702
(seven
hundred
...
two)
dollars
73
cents
txt
801
802
(eight
hundred
...
two)
dollars
83
cents
txt
901
902
(nine
hundred
one
and
...
two)
dollars
93
cents
ok, I add next rules:
if ( !NUM(prepre) & !NUM(pre) & NUM(cur) & BRO(nex) )
{
for (int nline_ = nline; nline_ < nlines_ - 3; nline_++)
{
string str = tsv_array.GetValue(nline_, 11).ToString();
if (DOL(str))
{
string cen = tsv_array.GetValue(nline_ + 2, 11).ToString();
if (cen == "cents")
{
nexnex = tsv_array.GetValue(nline_ + 1, 11).ToString();
cur = cur + "," + nexnex;
Numbers.Add(cur);
break;
}
}
}
}
if ( !NUM(prepre) & NUM(pre) & NUM(cur) & BRO(nex))
{
for (int nline_ = nline; nline_ < nlines_ - 3; nline_++)
{
string str = tsv_array.GetValue(nline_, 11).ToString();
if (DOL(str))
{
string cen = tsv_array.GetValue(nline_ + 2, 11).ToString();
if (cen == "cents")
{
nexnex = tsv_array.GetValue(nline_ + 1, 11).ToString();
cur = pre + cur + "," + nexnex;
Numbers.Add(cur);
break;
}
}
}
}
if (NUM(prepre) & !NUM(pre) & NUM(cur) & BRO(nex))
{
for (int nline_ = nline; nline_ < nlines_ - 3; nline_++)
{
string str = tsv_array.GetValue(nline_, 11).ToString();
if (DOL(str))
{
string cen = tsv_array.GetValue(nline_ + 2, 11).ToString();
if (cen == "cents")
{
nexnex = tsv_array.GetValue(nline_ + 1, 11).ToString();
cur = prepre + cur + "," + nexnex;
Numbers.Add(cur);
break;
}
}
}
}
bool BRO(string num)
{
if (num.Contains("(")) return true;
return false;
}
And it is works fine again:
701702,73
801802,83
901902,93
But my code is very complex now.
I hope that there is a simplest universal method like Finish State Automat or table filter which they use in compilers.
Added:
I found Сосо/R
https://ssw.jku.at/Research/Projects/Coco/Doc/UserManual.pdf
(compiler generator using C#).
Tell me please:
- if it suitable for my task?
- Is it not very old? May be there is a newest program and methods?
- If it suitable, can anybody give me the simple sample how to convert my grammar to Coco/R input files for I'll got the same output result as now?