How to avoid pdfbox appending separate words

Question

I'm making an application which allows searching in pdf's using apache Solr. I was having trouble finding certain terms in pdfs.

I noticed words in columns got appended.

Example

 Column1 | Column2
 stack   | overflow

Here the PdftextStripper would sometimes give me stackoverflow as extracted text. This would lead to bad tokinazation in solr which prevents you from finding the term. (Yes I know I can use wildcards but that doesn't work in phrase queries)

I have been looking at the sources to see what causes the problem. But it seems that the writePage method has to guess the spaces. I can't really change this since it seems very complex.

Are there any other solutions to get a good text extraction from a pdf with columns?

Maybe some sort of conversion other program.
Maybe patch for pdfbox.
Yes I've seen similar question but they mostly handle the order of the extraction(which in my case doesn't matter that much).

score 0 · Answer 1 · answered Dec 21 '12 at 06:03

0

I got the same problem while extracting text with PDFbox. I solved this issue by taking the position information of each character. I took x position and y position of each character. And implemented a simple logic to distinguish words. Before that my word delimitter was only the " "(space). I added one more logic that if the difference of the X position of two characters are beyond a certain value (this value will be your choice.) and it is in the same line, that is same y coordinate (Different y coordinate means certainly a new word), I treated them as a new word. With this logic I was able to solve problems with table content, new line etc.

This link will help you to get the position of characters from pdf with PDFbox.

answered Dec 21 '12 at 06:03

Neeraj

1,612
7
29
47

K I'll experiment with this to see if it works. How many px did you take between 2 characters for a new word? – DavidVdd Dec 21 '12 at 08:26
Use the code from http://stackoverflow.com/questions/13948853/pdf-find-out-if-text-is-underlined-or-a-table-cell – Neeraj Dec 21 '12 at 08:43
I think the word delimitter is already being estimated in pdfbox 1.7.1 not sure tough. – DavidVdd Dec 21 '12 at 08:50
I created my own character word, line and page objects. It is for my project purpose. I talked about that. – Neeraj Dec 21 '12 at 08:52
For your case just create one object named word. Parse your pdf file, character by character. When a word encounters (as per your logic) store that string in the word object and continue this process – Neeraj Dec 21 '12 at 08:55
http://stackoverflow.com/questions/13948853/pdf-find-out-if-text-is-underlined-or-a-table-cell I think you can use this code. From the object t you will get position information also – Neeraj Dec 21 '12 at 08:57

How to avoid pdfbox appending separate words

1 Answers1