Extracting text from PDF

Question

I am attempting to extract text from a PDF file using the code found here. The code employs the zlib library.

AFAICT the program works by finding blocks of memory between occurrences of the text "stream" and "endstream" in the pdf file. These chunks are then inflated by zlib.

The code works perfectly on one sample pdf document, but on another it appears that the zlib's inflate() function returns -3 (Z_DATA_ERROR) every time it is called.

I noticed that, the pdf file that fails, is set so that when opened in Adobe reader, there is no "copy" option. Could this be related to the inflate() error?... and if it is, is there a way around the problem?

Code snippet below - see comments

            //Now use zlib to inflate:
            z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));

            zstrm.avail_in = streamend - streamstart + 1;
            zstrm.avail_out = outsize;
            zstrm.next_in = (Bytef*)(buffer + streamstart);
            zstrm.next_out = (Bytef*)output;

            int rsti = inflateInit(&zstrm);
            if (rsti == Z_OK)
            {
                int rst2 = inflate (&zstrm, Z_FINISH); // HERE IT RETURNS -3
                if (rst2 >= 0)
                {
                    //Ok, got something, extract the text:
                    size_t totout = zstrm.total_out;
                    ProcessOutput(fileo, output, totout);
                }
            }

EDIT: I tested text extraction from the "encrypted" pdf via an online pdf-to-text converter called zamzar, and the resulting text file was perfect. So either zamzar has some super-duper decrypting system... or perhaps its just not very difficult.

EDIT: Just found that A-pdf also converted to text without problems.

A sample document that causes the error would help. Using a debugger to figure out where the error is would help. — Robert Jacobs, Jun 03 '15 at 14:32
The code from codeproject you reference is full of assumptions which sometimes are true and sometimes not. The fact that *there is no "copy" option* probably indicates that the PDF is encrypted to apply restrictions. It does not look like the codeproject code attempts decryption. So zlib tries to inflate encrypted data which obviously cannot work. A proper way around would be to use a proper PDF library. — mkl, Jun 03 '15 at 14:47
Some of these libraries appear to be very complex to install and get running. I am reluctant to go through all that work without having some indication of the probability of them working. — Mick, Jun 04 '15 at 07:17

score 5 · Answer 1 · edited May 23 '17 at 12:31

Streams in PDF need not be encoded with flate. They could be encoded with:

Nothing
LZW
Flate
ASCII85
Crypt (which could be one of several different algorithms)

And (surprise, surprise) any of these methods could also be layered on top of each other!

If there is no copy option, chances are it is encrypted with an owner password and no user password. This allows the author to create access permissions that are supposed to be honored by a reader including:

Modifying the document contents
Copying text/graphics
Adding/editing annotations
Printing
Form filling
Assembling the document (insert, delete pages, creating bookmarks, thumbnails)
High/low quality print

This particular approach to getting text out of a PDF is fraught with error and I can supply you with a set of documents that you won't be able to work with with your approach because of font re-encoding, split up text, oddball locations, form XObjects, unusual transformations, and so on.

To do this properly, you need a better set of tools that aren't blind to the actual format and structure of a PDF document. iText will do this, DotImage will do this.

To give you an idea of the scope of the problem, I wrote the original text search code in Acrobat 1.0 and with all the internal tools available to me, it took me many months to get it right and the code included the ability to find text in unusual, non-rectilinear orientations (think maps), handling ligatures, re-encoding, non-roman fonts, and so on. While I was working on that code, there was another engineer who was dedicated full time for several years writing code called Wordy to do something similar (but more complicated) for full-text extraction and indexing (see this answer for more information about Wordy).

You will be pleased to hear that my standard advise on failed text copying is still "If Acrobat can't do it, no-one can"! — Jongware, Jun 03 '15 at 20:49
See edits to original post. Two different packages succeeded in extracting the text without me knowing any passwords. — Mick, Jun 04 '15 at 07:12
*Two different packages succeeded in extracting the text without me knowing any passwords* - this essentially means that those packages disrespect both the PDF specification and the usage restrictions set explicitly by the author. — mkl, Jun 04 '15 at 08:14
Sounds to me that the data is not really "encrypted" at all. It sounds like there is simply a flag somewhere which simply says "do not allow"... but then this would not explain why zlib can not inflate(). Hmmmm, — Mick, Jun 04 '15 at 09:01
PDF files can have two passwords, a user password and an owner password. If there is no (empty) user password and there is an owner password, data is still encrypted in the file, but any reader can access it without the owner password because of the empty user password. Readers are supposed to honor the owner password restrictions. See section 7.6 and especially 7.6.3 and 7.6.3.2 in the PDF ISO spec for all the details. — plinth, Jun 04 '15 at 12:59

score 1 · Answer 2 · answered Jun 03 '15 at 15:02

1

If there´s no "copy" option then the pdf is encrypted and so is the stream. Plain zlib won't work, you'll have to decrypt the pdf first and now that you are at it use a proper library to extract text, there's a lot of encoding to take care, not everything is win ansi.

answered Jun 03 '15 at 15:02

Paulo Soares

1,896
8
21
19

when you say the pdf is "encrypted" - does that mean that I need to know some password to decrypt? (I do not have a password)... Will a "proper library" even help? – Mick Jun 03 '15 at 15:07
If no password is needed to open it then it´s encrypted without user password. I only know about iText (disclaimer, I'm an iText contributor) but there's also pdfbox in the free option in java. – Paulo Soares Jun 03 '15 at 15:51
"encrypted without user password"? This makes no sense to me. Its sounds like the file is scrambled up, but in a way that anyone can unscramble... is that right? – Mick Jun 03 '15 at 15:59
Just tried pdfbox and it reported "You do not have permission to extract text" – Mick Jun 03 '15 at 16:14
1

So, as everybody guessed the PDF is encrypted to restrict usage permissions, and as you don't have a password and therefore cannot supply the owner password to PDFBox, it correctly tells you that you are not allowed to extract the text. – mkl Jun 03 '15 at 20:50
1

@Mick *Its sounds like the file is scrambled up, but in a way that anyone can unscramble... is that right?* - essentially yes, it mostly serves as a hint to PDF processing software that permission restrictions are set. This mechanism obviously requires PDF processors to cooperate, it does not really enforce the restrictions. – mkl Jun 04 '15 at 08:25

score -4 · Answer 3 · edited May 23 '17 at 11:44

-4

It can be possible because headers differs from what document to another, for this see related question ZLib Inflate() failing with -3 Z_DATA_ERROR.

edited May 23 '17 at 11:44

Community

1
1

answered Jun 03 '15 at 14:35

Mihai8

3,113
1
21
31

I tried using inflateInit2() as per the marked answer, but that didn't fix it :-( – Mick Jun 03 '15 at 14:52

Extracting text from PDF

3 Answers3