I am currently trying to OCR some .tiff files. Apparently, 'Tesseract' only scans the first page of each file. I've been searching for a hint on Google, but that wasn't very helpful. This code is supposed to get the FULL text from each of the .tiff files:
public async Task<List<string>> ScannFile(string file)
{
if (Path.GetFileName(file).EndsWith(".pdf"))
{
MessageBox.Show("Sie können nur .tiff Dokumente einscannen!");
return null;
}
else
{
List<string> PageContent = new();
await Task.Run(new Action(() =>
{
using (var engine = new TesseractEngine(@"C:\Users\f.rigo\source\repos\FinalScanner\FinalScanner\bin\Debug\net5.0-windows/tessdata", "deu", EngineMode.TesseractOnly))
{
using (var img = Pix.LoadFromFile(file))
{
//img.Scale((float)scann_dpi / 2, (float)scann_dpi / 2);
using (var page = engine.Process(img))
{
var text = page.GetText();
PageContent = cleanOCROutput(text);
}
}
}
}));
return PageContent;
}
}
I tried to get the full file by using a for-each loop, but unfortunately the "img" doesn't contain anything enumerable. By the way, I am using the Tesseract lib. by Charles Weld.
Do you have any suggestions for how I can scan the 2nd and later pages of .tiff files?