0

I'm trying to parse a document using Apache Tika that unfortunately replaces some character sequences - "ti", "fb" for example - with the an unknown Unicode symbol. I don't see a way to manage this using Tika itself, as the replacement character seems to be coming from PDFBox.

I also noticed that the character sequences in question are not part of the GlyphList. Would it be possible to add the sequences and a mapping to the GlyphList to get the expected output? I'm using Tika 1.21 with PDFBox 2.0.15.

Fran
  • 51
  • 5
  • 2
    Those appear to be ligatures – chiliNUT Nov 20 '19 at 20:13
  • 1
    https://stackoverflow.com/questions/22348632/handle-ligatures-in-apache-tika – chiliNUT Nov 20 '19 at 20:14
  • Yes, they do appear to be ligatures or some other diglyphs; however, these particular combinations do not get handled by default. Others, such as "ff", seem to be known to PDFBox. – Fran Nov 20 '19 at 20:18
  • 1
    What happens if you copy paste from Adobe Reader? – Tilman Hausherr Nov 21 '19 at 04:20
  • Using the evince reader, I have the word "interesting" in the .pdf. I can find "interes". It cannot find "interest". – Fran Nov 21 '19 at 16:31
  • Same with Adobe Acrobat. No "t" is found in the string. – Fran Nov 21 '19 at 16:36
  • Then it is your PDF that is to blame: https://pdfbox.apache.org/2.0/faq.html#text-extraction – Tilman Hausherr Nov 22 '19 at 10:23
  • Ok. So are you confident that there is nothing that can be done to fix or handle the situation by configuring PDFBox or lower-level libraries? – Fran Nov 22 '19 at 16:50
  • Yes. It might be possible to fix that PDF itself. But that would be a lot of work. https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0/39644941 – Tilman Hausherr Nov 25 '19 at 10:34

0 Answers0