C# app not reading txt files after OCR is done, regardless of content

Question

I made an app in C# in which you drop a PDF file, it its converted to PNG, that PNG is cropped to several parts, and then, OCR is performed on these parts, and write on TXT files is made. So far so good. The issue I'm having is when I try to read those txt-based-on-OCR files: no reading is made. Sometimes it reads all files, sometimes it doesn't (none of the files, actually). Following is the code I made to read those files:

var reader = new StreamReader(StoreTextFilePath2);
string direcc = reader.ReadToEnd().ToString();
var reader3 = new StreamReader(StoreTextFilePath3);
string npoliza = reader3.ReadToEnd().ToString();
var reader4 = new StreamReader(StoreTextFilePath4);
string inixo = reader4.ReadToEnd().ToString().Replace("-", "/").Replace(" ", "");
var reader5 = new StreamReader(StoreTextFilePath5);
string finxo = reader5.ReadToEnd().ToString().Replace("-", "/").Replace(" ", "");
var reader6 = new StreamReader(StoreTextFilePath6);
string seccc = reader6.ReadToEnd().ToString();
var reader7 = new StreamReader(StoreTextFilePath7);
string phono = reader7.ReadToEnd().ToString();
var reader8 = new StreamReader(StoreTextFilePath8);
string nyaaa = reader8.ReadToEnd().ToString();
var reader9 = new StreamReader(StoreTextFilePath9);
string dniii = reader9.ReadToEnd().ToString();
var reader10 = new StreamReader(StoreTextFilePath10);
string antep = reader10.ReadToEnd().ToString();

As you can see, those strings should take those readings of those TXT, but they keep empty, regardless of the TXT content. Am I doing something wrong? Since the PNG and TXT are valid, and actually contain valid text (not invalid characters that could be the result of wrong/improper OCR).

Thank you in advance to anyone who can help me.

PS: the "StoreTextFilePath" specified for StreamReader are different TXT files which obtained first the text of the PNG via OCR.

https://stackoverflow.com/questions/2572963/streamreader-readtoend-returning-an-empty-string I think this is a duplicate — Morten Bork, Jun 13 '21 at 15:51
Are you closing/disposing the the readers? If yo are not, this leaves all there files open. Btw., the PFDs contain text. What is the point of converting to an image and then do OCR on it? OCR is not very reliable. Try to extract the text from the PDF directly. Based on an array of file paths you could do this in a loop. — Olivier Jacot-Descombes, Jun 13 '21 at 15:51
Does this answer your question? [StreamReader.ReadToEnd() returning an empty string](https://stackoverflow.com/questions/2572963/streamreader-readtoend-returning-an-empty-string) — Ian Kemp, Jun 13 '21 at 15:54
Mark Bork: I tried also with ReadLine(), but had the same result. — Gustavo Javier Gonzalez, Jun 13 '21 at 16:24
Olivier: I need to read very specific parts of the PDFs, several times (once per file) so OCR in the PDF found it really expensive ($400 to make OCR in specific areas is an amount I can't pay). Despite that fact, actually I found OCR in PNG really good, with a 1% of error (made 100 OCR, only one PNG had errors). Also tried loops, and had unexpected results. — Gustavo Javier Gonzalez, Jun 13 '21 at 16:28
Ian: I'm gonna try that later, but sometimes the "not reading" issue happens the first time I drop a PDF, so I think it's not related to the problem-and-solution in the link you gave me. Anyways, gonna try later :) thank you all — Gustavo Javier Gonzalez, Jun 13 '21 at 16:30
You don't show the cropping and ocr part. Is it possible that this may be an async task that isn't finished yet, when you try to read the text files? — derpirscher, Jun 13 '21 at 16:33
derpirscher: not at all, since sometimes I drop 3 files in a row, and all of them are read. It's not an async task, since everytime I drop the PDF, all OCR is done correctly, with the TXT saving, but when I try to read the TXT files, there's no reading at all (and not always). — Gustavo Javier Gonzalez, Jun 13 '21 at 21:01
The problem is not caused by the code you are showing here. The reading works fine as it is, if the files are there and properly saved. The problem lies in the process creating the files (or in the synchronization of writing and reading). When you say "sometimes it works, sometimes not" this is an indication the problem is related to a timing issue, ie the process creating the files isn't finished yet, when the reading process already tries to access them. — derpirscher, Jun 14 '21 at 12:12
derpirscher: in fact, I actually found out that the problem lies in the crop process. I finally solved the issue involving the reading, but now, installed in another PC, the crop of the file is made only 3 times. It should be 9, but for some reason, stops at that amount. I´ll try to solve that now. Thank you anyways :) — Gustavo Javier Gonzalez, Jun 15 '21 at 13:20

score 1 · Answer 1 · answered Jun 13 '21 at 16:30

1

Why don't you try reading the txt files with File.ReadAllText(FilePath);?
It's a lot easier like this.
And be sure that the content of the Txt Files isn't empty.

answered Jun 13 '21 at 16:30

CraftingDragon007

11
1
3

and closing and disposing are done automatically. – Olivier Jacot-Descombes Jun 13 '21 at 17:11
Also tried File.ReadAllText(); , but had the same result: sometimes no file is read at all. – Gustavo Javier Gonzalez Jun 13 '21 at 20:57
Are you sure that the files aren‘t empty? – CraftingDragon007 Jun 14 '21 at 04:50
CraftingDragon007: not that the files are empty, the files are not created. But that´s an issue of the library I'm working with (ImageProcessor), not a problem of the issue initiated in this post. Thanks :) – Gustavo Javier Gonzalez Jun 15 '21 at 13:25

C# app not reading txt files after OCR is done, regardless of content

1 Answers1