0

How to read the texts from a pdf file created by Adobe Distiller tool?

I'm currently using ABCPdf tool and I have a code sample to read pdf contents but it can only read the texts from pdfs which have been created by Adobe PDF Library:

   public string ExtractTextsFromAllPages(string pdfFileName)
        {
            var sb = new StringBuilder();

            using (var doc = new Doc())
            {
                doc.Read(pdfFileName);

                for (var currentPageNumber = 1; currentPageNumber <= doc.PageCount; currentPageNumber++)
                {
                    doc.PageNumber = currentPageNumber;
                    sb.Append(doc.GetText("Text"));
                }
            }

            return sb.ToString();
        }

I have other pdf files which have been created by Adobe Distiller and the above code doesn't work; I mean it returns the below strange data which seems encoded:

\0\a\b\0\t\n\0\r\n\0\a\b\t\n\n\b\v\f\0\t\r\f\b\0\r\0\r\n\v\b\v\f\f\n\r\0\r\0\0\0\b\r\n\0\a\r\0\0\b\r\b\b\t\n\r\0\b\r\n\t\b\v\n\b\v\v\0\a\b\r\n\r\n\v\r\0\b\b\b\v\r\0\r\n\v\f\r\f\f\r\n !\"\"\v#\t $ %&$% $'\v\"% \0( )% ! !\"\"'*$'\r\n\t $ %&$% $'\v\"% \0( \r\n\f\f\f\f\b\f\f\f\f\a \b\b\f\f\f!\"\r\n\f\a#$\f\f\f\b\f\f\a%\a \b\b\f\a\a&\a\a' \b\a\b\r\n(\f)\f)

How to read the texts from a pdf file created by Adobe Distiller tool?

To be said that I can open such pdf files using my browser easily like other pdfs.

Thanks,

The Light
  • 26,341
  • 62
  • 176
  • 258
  • Are you able to copy and paste text from the PDF using Adobe Reader or any other PDF viewer? – Bobrovsky Jun 12 '12 at 19:27
  • Apparently not with Adove Reader. Not sure whether the text reading or copy/paste feature is manipulated/encrypted, etc. – The Light Jun 13 '12 at 08:44

4 Answers4

0

I've had similar problems with working with PDF's. I've not used ABCPdf, but you may want to check out iTextSharp, I've created a tool to extract strings from PDF files using that before, however you're still going to have a problem if the font is embedded. If you are able to switch up to iTextSharp, here is a question on SO that goes over the topic:

Reading PDF content with itextsharp dll in VB.NET or C#

Community
  • 1
  • 1
Cyric297
  • 69
  • 7
0

First thing to try is to copy and paste text from the PDF using Adobe Reader or any other PDF viewer.

If you can not copy and paste text at all then text extraction feature might be disabled via permissions in the file. Usually permissions are ignored by PDF libraries and do not affect text extraction.

If you can copy and paste text from the file but it looks garbled/incorrect then the PDF does not contain some information required for text extraction to be performed properly. Such files will be displayed properly.

Adobe Distiller produces files without information required for proper text extraction if it's configured to produce smallest files possible.

EDIT:

If you need to discriminate garbage chars from meaningful text then you should implement an algorithm that measures the readability of text.

Some links for that:

Community
  • 1
  • 1
Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
  • thanks; how to determine whether the text is garbled or a valid text? both has Ascii characters. There is no one common word that I can rely on to be identified as a valid pdf. – The Light Jun 14 '12 at 09:43
  • it's another question :-) I suspect you don't need to know that. After all, what can you do if text is garbled? Nothing? – Bobrovsky Jun 14 '12 at 10:45
  • I'd need to know whether I should be ignoring or storing that text ;) – The Light Jun 15 '12 at 08:40
0

So, the fact, that you just do not see some readable text might be caused by a strange encoding used. We normally assume that an ASCII caracter set is used for encoding. Imaging the sentence "Hello world" (ASCII to HEX would be: 48 65 6C 6C 6F 20 77 6F 72 6C 64) In a straightforward way we would assume that the meaning would be 48 for a "H", 65 for "e" and so on.

But fancy an engineer doing his own subsetting of fonts: For encoding "H" as the first emerging letter he uses 00, for e then 01. The sentence would then be encoded like 00 01 02 02 03 04 05 03 06 02 07

This will result in a couple of unreadable characters. Just like ancient secret scripts which encode and decode via a secret encoding table.

The answer to your question is simply: You can read text generated from distiller only when you know the right encoding vector for reassembling.

bmx
  • 1
-1

ABCpdf can extract text from all PDFs that contain valid text. It infers spaces, de-hyphenates, clips to an area of interest and many other things that are required to ensure that the text you get is the same as the text you see.

However all this assumes that the PDF is valid - that it conforms to the PDF spec - that it is not corrupt.

The most common cause of text extraction problems are corrupt Identity encoded fonts. Identity encoded fonts are referenced by glyph rather than by character code. The fonts include a ToUnicode map to allow the glyph IDs to be converted to characters.

However we sometimes see documents from which this entry has been removed. This means that the only way only way to identify the characters would be to OCR the document.

You can see this yourself if you open the documents in Acrobat and copy the text. When you paste the copied text into an application such as notepad you will be able to see that it is wrong. ABCpdf just sees the same as Acrobat.

The fact that these documents have been so thoroughly and effectively mangled may be intentional. It is certainly a good way to ensure no-one can copy your text.

I wrote the ABCpdf .NET text extraction so I should know. :-)