Tesseract OCR Text Position

Question

I am working on OCR using tesseract. I am able to make the application working and get the output. Here i'm trying to extract data from an invoice bill and getting the extracted data. But the spacing between words in input has to be similar in output file.I am now getting each words and coordinates.I need to export to text file according to coordinates

Code Sample :

            using (var engine = new TesseractEngine(Server.MapPath(@"~/tessdata"), "eng", EngineMode.Default))
            {
                engine.DefaultPageSegMode = PageSegMode.AutoOsd;
                // have to load Pix via a bitmap since Pix doesn't support loading a stream.

                using (var image = new System.Drawing.Bitmap(imageFile.PostedFile.InputStream))
                {

                    Bitmap bmp = Resize(image, 1920, 1080);

                    using (var pix = PixConverter.ToPix(image))
                    {
                        using (var page = engine.Process(pix))
                        {
                            using (var iter = page.GetIterator())
                            {
                                iter.Begin();
                                do
                                {
                                    Rect symbolBounds;
                                    string path = Server.MapPath("~/Output/data.txt");
                                    if (iter.TryGetBoundingBox(PageIteratorLevel.Word, out symbolBounds))
                                    {
                                        // do whatever you want with bounding box for the symbol

                                    var curText = iter.GetText(PageIteratorLevel.Word);

                                        //WriteToTextFile(curText, symbolBounds, path);
                                        resultText.InnerText += curText;
                                        // Your code here, 'rect' should containt the location of the text, 'curText' contains the actual text itself
                                    }
                                } while (iter.Next(PageIteratorLevel.Word));
                            }


                            meanConfidenceLabel.InnerText = String.Format("{0:P}", page.GetMeanConfidence());

                        }
                    }
                }
            }

Here is an example of input and output showing the wrong spacing.

Input Output

I have attached my input & output file.The spacing between words in input has to be similar in output file — ab2015, Jul 11 '18 at 12:26
I am making one POC type project with teceract. Could you pls guide to which document should i refer yo make simple read — Prashant Pimpale, Jul 12 '18 at 04:40

GWigWam · Accepted Answer · 2018-07-11T13:33:34.533

13

You can loop through found items in the page using page.GetIterator(). For the individual items you can get a 'bounding box', this is a Tesseract.Rect (rectangle struct) which contains: X1, Y1, X2, Y2 coordinates.

Tesseract.PageIteratorLevel myLevel = /*TODO*/;
using (var page = Engine.Process(img))
using (var iter = page.GetIterator())
{
    iter.Begin();
    do
    {
        if (iter.TryGetBoundingBox(myLevel, out var rect))
        {
            var curText = iter.GetText(myLevel);
            // Your code here, 'rect' should containt the location of the text, 'curText' contains the actual text itself
        }
    } while (iter.Next(myLevel));
}

There is no clear-cut way to use the positions in the input to space the text in the output. You're going to have to write some custom logic for that.

You might be able to estimate the number of spaces you need to the left of your text with something like this:

var padLeftSpaces = (int)Math.Round((rect.X1 / inputWidth) * outputWidthSpaces);

edited Jul 11 '18 at 13:33

answered Jul 11 '18 at 12:26

GWigWam

2,013
4
28
34

1

@ab2015, I have answered your question, I hope you can fix your code yourself since you are more familiar with it. – GWigWam Jul 11 '18 at 12:35
iter.TryGetBoundingBox(myLevel, out var rect).myLevel not declared – ab2015 Jul 11 '18 at 12:36
`myLevel` is a variable of type `Tesseract.PageIteratorLevel`, you must pick one yourself. You probably want to use `PageIteratorLevel.Word` or `PageIteratorLevel.TextLine`. – GWigWam Jul 11 '18 at 12:37
Now i have each words and the coordinates.I need to write the words to textfile according to its coordinates.Can you help me in that – ab2015 Jul 11 '18 at 13:21
@ab2015 I've updated my answer with some hints. I hope you can implement a complete solution yourself. – GWigWam Jul 11 '18 at 13:43

Tesseract OCR Text Position

1 Answers1

Linked