I have a PDF which contains a scanned document where I should be reading some parts of it. I already had it done with Google Cloud OCR, but I just noticed it might not be adequate as I'll be exceeding monthly quota (1k requests/month), so instead I'm switching to Tessaract.
The project is done in Windows and Java, but currently I'm doing some tests using linux.
I am not uploading my original image or none of them as I am not sure if it contains sensible information, but rather some images from the internet which are VERY similar.
I have read that I can help improve Tessaract to have a better quality doing some previous work on the original image (using TextCleaner?). I would like to know how to do that kind of stuff in a windows/java enviroment and most important, how to eliminate successfully the dark background on the table and if possible eliminate the horizontal and vertical lines of the table as the don't help at all during the OCR.