1

i have a problem with some PDF files which i need to extract text from. The PDFs are generated by the same institution. I found a topic from Stack Overflow on how to do the mappings manually. I tried that, but the problem is that each file i look at, has slight differences in the CIDs/GIDs.

For example:

file1.pdf

file2.pdf

file3.pdf

Is there a way to fix the font somehow or the only option would be to use OCR?

rainit
  • 11
  • 1
  • The mappings seem to indicate that an ad-hoc encoding is used: Starting from some initial value (in your case apparently 7) the first glyph from the font used on the page is given that starting value as code, the next, different glyph is given starting value plus one, the next, different glyph is given starting value plus two, etc. In your case those encodings only have "slight differences" because the documents appear to be generated similarly. To check whether there is some way to make sense out of the encoding, please share representative sample PDFs. – mkl Apr 10 '18 at 07:07
  • https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0 – Tilman Hausherr Apr 10 '18 at 08:30
  • This is exactly what i used. But as the CIDs and GIDs change, this is not reliable. Unfortunately the PDFs contain data that can't be shown to the public, i can't share the PDFs as well. @mkl Is there something i can extract for You? – rainit Apr 10 '18 at 10:09
  • One check you can do yourself: Try copy&paste of critical text (text not properly extracted by PDFBox) from Adobe Reader to an editor of your choice. Is the result correct or not? – mkl Apr 10 '18 at 12:09
  • The result from copy paste are only spaces. – rainit Apr 10 '18 at 13:19
  • @mkl The result from copy/pasting in Adobe reader are only question marks. So i think there's nothing much i can do? I tried to extract the character codes with PDFBox. Though maybe i could then do the mapping to CID. However i have been unable to found, what the character codes actually represent. For example:16980, 2629, 21514, 28938. I read the page contents as input stream and used font.readCode(is); – rainit Apr 13 '18 at 07:47
  • *"The result from copy paste are only spaces."* / *"The result from copy/pasting in Adobe reader are only question marks."* - Well, it seems a bit unclear what the result exactly is but clearly it does not show the desired characters. Text extraction as done by Adobe Reader Copy&Paste is already fairly good as far as regular text extraction goes, so if that does not succeed, regular pdfbox text extraction likely will fail, too. Probably (not that likely really) one could customize pdfbox text extraction by using information from the embedded font but as you cannot share the PDF, I cannot tell. – mkl Apr 13 '18 at 08:24

0 Answers0