I am working on OCR using tesseract. I am able to make the application working and get the output. Here i'm trying to extract data from an invoice bill and getting the extracted data. But the spacing between words in input has to be similar in output file.I am now getting each words and coordinates.I need to export to text file according to coordinates
Code Sample :
using (var engine = new TesseractEngine(Server.MapPath(@"~/tessdata"), "eng", EngineMode.Default))
{
engine.DefaultPageSegMode = PageSegMode.AutoOsd;
// have to load Pix via a bitmap since Pix doesn't support loading a stream.
using (var image = new System.Drawing.Bitmap(imageFile.PostedFile.InputStream))
{
Bitmap bmp = Resize(image, 1920, 1080);
using (var pix = PixConverter.ToPix(image))
{
using (var page = engine.Process(pix))
{
using (var iter = page.GetIterator())
{
iter.Begin();
do
{
Rect symbolBounds;
string path = Server.MapPath("~/Output/data.txt");
if (iter.TryGetBoundingBox(PageIteratorLevel.Word, out symbolBounds))
{
// do whatever you want with bounding box for the symbol
var curText = iter.GetText(PageIteratorLevel.Word);
//WriteToTextFile(curText, symbolBounds, path);
resultText.InnerText += curText;
// Your code here, 'rect' should containt the location of the text, 'curText' contains the actual text itself
}
} while (iter.Next(PageIteratorLevel.Word));
}
meanConfidenceLabel.InnerText = String.Format("{0:P}", page.GetMeanConfidence());
}
}
}
}
Here is an example of input and output showing the wrong spacing.