17

According to this site http://www.searchable-pdf.com/content.php?lang=en&c=61, a PDF can be searchable when a text layer is added.

I was looking for the technical specification of a PDF. I think text can be stored in 2 ways into a PDF: a) as a text layer above the image layer (as described in the webpage above) b) when you create a PDF from a Word document (with text), I don't think Word will store all the text in the text layer. I think it will store it in the image layer? Right?

Since PDF 1.4, XMP has been added (http://en.wikipedia.org/wiki/Extensible_Metadata_Platform). But what is XMP? Is this the "text layer" which I discussed above?

If a scanner is performing OCR on an image, is it storing the text in the "text layer"? Or the "XMP" field? This can only be when a PDF is of version 1.4?

And how can I detect if a PDF already has text data? For example: PDF A has been scanned with OCR and PDF B has not. How can I know that PDF B should be sent to a separate OCR engine?

Jochen Hebbrecht
  • 733
  • 2
  • 9
  • 23
  • Usually, after OCR the text is added in 'invisible' text rendering mode to the *normal* content of the PDF (not an extra *layer*, that's made invisible -- which is also a technical possibility in PDF; look for *Optional Content* in the PDF specification). ---- However, in real world PDFs (both, 'scanned' as well as 'normal' PDFs), you'll often find that you can select the text and copy it -- but after pasting, you'll only have gobbledigook. Or if you use `pdftotext` on such a file... If so, then it's a problem with the *encoding* of the font used.... – Kurt Pfeifle Jul 10 '12 at 17:51

2 Answers2

14

The PDF specification has no mention of a 'text layer'. Normally, there is just one way to 'store' text: by means of text showing operators. These operators draw text at a specific location, using a specific color, font, font size and text rendering mode. There are several text rendering modes. For the purpose of answering your question, text can be visible or invisible.

A scanner that performs OCR, renders both the raster image and text to the PDF document. The text is rendered using the invisible text rendering mode. The result is that you can select the text using a mouse (the highlighted area will be shown at the expected location on top of the image) and you can search for text. Again the search result will be shown at the correct location.

What happens when you generate PDF from a Word document depends on the software that you use to convert. To my knowledge, these converters do not generate an image but they will generate visible text.

XMP is meta data as opposed to visual data.

Finally, with respect to your question about detecting whether a PDF has text data, here is a similar question (10k only).

Ooker
  • 1,969
  • 4
  • 28
  • 58
Frank Rem
  • 3,632
  • 2
  • 25
  • 37
  • Some other questions I have: * can each version (http://en.wikipedia.org/wiki/Portable_Document_Format#Adobe.27s_versions) of PDF contain text? Is there a specification in the format that tells you how to store text?
    * if you have a PDF which has been OCR'd, but you "re-OCR" it again using another OCR engine, what will happen with the previous OCR text?
    – Jochen Hebbrecht Jul 10 '12 at 17:45
  • @JochenHebbrecht: Look at my answer. It also provides a link to the spec. **Of course** there are exact rules in the spec about how to store texts (but you'll not find them in Wikipedia). – Kurt Pfeifle Jul 10 '12 at 19:06
  • @Jochen Hebbrecht: I'm pretty sure the Re-OCR-ing engine will make sure to replace the previously present OCR text. (The weaker ones will refuse to run and tell you that they can't proceed because there is text already there, or whatever...) – Kurt Pfeifle Jul 10 '12 at 19:07
  • 1
    @FrankRem The similar question you linked has vanished. Is it possible to insert some of the info that was there? – Fildor Jul 15 '16 at 07:29
  • Last link is still broken :( – jtlz2 Sep 17 '19 at 10:55
  • @jtlz2 the question was deleted since. A search may give you similar questions. – Frank Rem Sep 18 '19 at 11:05
9

I upvoted Frank Rem's answer, because it is 'complete'.

Let me add a few details however:

  1. The 'invisibility' of text comes from Tr, the text rendering mode 3 operator in PDF: "Neither fill nor stroke text" (PDF-1.7 spec, Chapter 9.3.6).
  2. Have a look at this SuperUser question: "PDF has an extra blank in all words after running through Ghostscript" and my answers over there to learn a few more things about the technical details (esp. look at the one with the headline "How can we make the invisible text visible?").
Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345