4

I'm making an application which allows searching in pdf's using apache Solr. I was having trouble finding certain terms in pdfs.

I noticed words in columns got appended.

Example

 Column1 | Column2
 stack   | overflow

Here the PdftextStripper would sometimes give me stackoverflow as extracted text. This would lead to bad tokinazation in solr which prevents you from finding the term. (Yes I know I can use wildcards but that doesn't work in phrase queries)

I have been looking at the sources to see what causes the problem. But it seems that the writePage method has to guess the spaces. I can't really change this since it seems very complex.

Are there any other solutions to get a good text extraction from a pdf with columns?

  • Maybe some sort of conversion other program.
  • Maybe patch for pdfbox.
  • Yes I've seen similar question but they mostly handle the order of the extraction(which in my case doesn't matter that much).
BenMorel
  • 34,448
  • 50
  • 182
  • 322
DavidVdd
  • 1,020
  • 3
  • 17
  • 40

1 Answers1

0

I got the same problem while extracting text with PDFbox. I solved this issue by taking the position information of each character. I took x position and y position of each character. And implemented a simple logic to distinguish words. Before that my word delimitter was only the " "(space). I added one more logic that if the difference of the X position of two characters are beyond a certain value (this value will be your choice.) and it is in the same line, that is same y coordinate (Different y coordinate means certainly a new word), I treated them as a new word. With this logic I was able to solve problems with table content, new line etc.

This link will help you to get the position of characters from pdf with PDFbox.

Neeraj
  • 1,612
  • 7
  • 29
  • 47
  • K I'll experiment with this to see if it works. How many px did you take between 2 characters for a new word? – DavidVdd Dec 21 '12 at 08:26
  • Use the code from http://stackoverflow.com/questions/13948853/pdf-find-out-if-text-is-underlined-or-a-table-cell – Neeraj Dec 21 '12 at 08:43
  • I think the word delimitter is already being estimated in pdfbox 1.7.1 not sure tough. – DavidVdd Dec 21 '12 at 08:50
  • I created my own character word, line and page objects. It is for my project purpose. I talked about that. – Neeraj Dec 21 '12 at 08:52
  • For your case just create one object named word. Parse your pdf file, character by character. When a word encounters (as per your logic) store that string in the word object and continue this process – Neeraj Dec 21 '12 at 08:55
  • http://stackoverflow.com/questions/13948853/pdf-find-out-if-text-is-underlined-or-a-table-cell I think you can use this code. From the object t you will get position information also – Neeraj Dec 21 '12 at 08:57