According to this site http://www.searchable-pdf.com/content.php?lang=en&c=61, a PDF can be searchable when a text layer is added.
I was looking for the technical specification of a PDF. I think text can be stored in 2 ways into a PDF: a) as a text layer above the image layer (as described in the webpage above) b) when you create a PDF from a Word document (with text), I don't think Word will store all the text in the text layer. I think it will store it in the image layer? Right?
Since PDF 1.4, XMP has been added (http://en.wikipedia.org/wiki/Extensible_Metadata_Platform). But what is XMP? Is this the "text layer" which I discussed above?
If a scanner is performing OCR on an image, is it storing the text in the "text layer"? Or the "XMP" field? This can only be when a PDF is of version 1.4?
And how can I detect if a PDF already has text data? For example: PDF A has been scanned with OCR and PDF B has not. How can I know that PDF B should be sent to a separate OCR engine?