I'm making an application which allows searching in pdf's using apache Solr. I was having trouble finding certain terms in pdfs.
I noticed words in columns got appended.
Example
Column1 | Column2
stack | overflow
Here the PdftextStripper would sometimes give me stackoverflow as extracted text. This would lead to bad tokinazation in solr which prevents you from finding the term. (Yes I know I can use wildcards but that doesn't work in phrase queries)
I have been looking at the sources to see what causes the problem. But it seems that the writePage method has to guess the spaces. I can't really change this since it seems very complex.
Are there any other solutions to get a good text extraction from a pdf with columns?
- Maybe some sort of conversion other program.
- Maybe patch for pdfbox.
- Yes I've seen similar question but they mostly handle the order of the extraction(which in my case doesn't matter that much).