0

I want to write a function like that:

input: a PDF file, a string (the PDF is searchable - it was created by MS Word, for example) output: page and position (coordinate: x and y) of the string in the PDF file, if any.

Could you give me some hint (what library, approach, ...) to do it in Python?

Thank you very much

mommomonthewind
  • 4,390
  • 11
  • 46
  • 74
  • check this question: http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text – Elisha Jun 25 '14 at 14:39
  • Thank you very much for your reference, but I am afraid that it is not exact what I want. I do not want to extract text from PDF, but I want to find the position of text in PDF. – mommomonthewind Jun 25 '14 at 18:17

1 Answers1

0

You might need to check PDF specification 7.7 Document Structure and 9. Text to get at least little bit of imagination of how the text is stored in PDF.

Approach:

Traversing every single page using Page Tree contains Page Objects, where we search for its Contents field. Contents of this field is basically page elements described by Postscript language.


Example:

The text ABC is placed 10 inches from the bottom of the page and 4 inches from the left edge, using 12-point Helvetica.

BT
    /F13 12 Tf
    288 720 Td
    (ABC) Tj
ET 

Strings inside can be represented as:

Literal string (7.3.4.2) - this is pretty much straight-forward, as you just walk the data for "(.*?)"

Hexadecimal string (7.3.4.3) - this is a tricky one, because we have to decode the data before we can compare to the string we are searching for.

After we matched the string, the last thing remaining is figure out its position. This basically requires parsing of the Postscript language.

Most of these things I have mentioned are already implemented in many products (itext, GhostScript, ...) which you can easily read as a reference implementation.

I personally do not have any experience with python based PDF library, you should figure this one out on your own.

Gotcha
  • 392
  • 2
  • 10
  • *Literal string (7.3.4.2) - this is pretty much straight-forward, as you just walk the data for "(.*?)"* - That's only true for simple examples using standard font encoding. Meanwhile custom encodings for embedded fonts have become very common. – mkl Jul 03 '14 at 12:36