Undefined characters replacing text ("ti", "fb" for example) in Apache Tika output

Asked Nov 20 '19 at 20:09

Active Nov 20 '19 at 20:20

Viewed 192 times

I'm trying to parse a document using Apache Tika that unfortunately replaces some character sequences - "ti", "fb" for example - with the an unknown Unicode symbol. I don't see a way to manage this using Tika itself, as the replacement character seems to be coming from PDFBox.

I also noticed that the character sequences in question are not part of the GlyphList. Would it be possible to add the sequences and a mapping to the GlyphList to get the expected output? I'm using Tika 1.21 with PDFBox 2.0.15.

edited Nov 20 '19 at 20:20

asked Nov 20 '19 at 20:09

Fran

2

Those appear to be ligatures – chiliNUT Nov 20 '19 at 20:13
1

https://stackoverflow.com/questions/22348632/handle-ligatures-in-apache-tika – chiliNUT Nov 20 '19 at 20:14
Yes, they do appear to be ligatures or some other diglyphs; however, these particular combinations do not get handled by default. Others, such as "ff", seem to be known to PDFBox. – Fran Nov 20 '19 at 20:18
1

What happens if you copy paste from Adobe Reader? – Tilman Hausherr Nov 21 '19 at 04:20
Using the evince reader, I have the word "interesting" in the .pdf. I can find "interes". It cannot find "interest". – Fran Nov 21 '19 at 16:31
Same with Adobe Acrobat. No "t" is found in the string. – Fran Nov 21 '19 at 16:36
Then it is your PDF that is to blame: https://pdfbox.apache.org/2.0/faq.html#text-extraction – Tilman Hausherr Nov 22 '19 at 10:23
Ok. So are you confident that there is nothing that can be done to fix or handle the situation by configuring PDFBox or lower-level libraries? – Fran Nov 22 '19 at 16:50
Yes. It might be possible to fix that PDF itself. But that would be a lot of work. https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0/39644941 – Tilman Hausherr Nov 25 '19 at 10:34

Undefined characters replacing text ("ti", "fb" for example) in Apache Tika output

0 Answers0