I am attempting to extract text from a PDF file using the code found here. The code employs the zlib library.
AFAICT the program works by finding blocks of memory between occurrences of the text "stream" and "endstream" in the pdf file. These chunks are then inflated by zlib.
The code works perfectly on one sample pdf document, but on another it appears that the zlib's inflate()
function returns -3 (Z_DATA_ERROR) every time it is called.
I noticed that, the pdf file that fails, is set so that when opened in Adobe reader, there is no "copy" option. Could this be related to the inflate()
error?... and if it is, is there a way around the problem?
Code snippet below - see comments
//Now use zlib to inflate:
z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));
zstrm.avail_in = streamend - streamstart + 1;
zstrm.avail_out = outsize;
zstrm.next_in = (Bytef*)(buffer + streamstart);
zstrm.next_out = (Bytef*)output;
int rsti = inflateInit(&zstrm);
if (rsti == Z_OK)
{
int rst2 = inflate (&zstrm, Z_FINISH); // HERE IT RETURNS -3
if (rst2 >= 0)
{
//Ok, got something, extract the text:
size_t totout = zstrm.total_out;
ProcessOutput(fileo, output, totout);
}
}
EDIT: I tested text extraction from the "encrypted" pdf via an online pdf-to-text converter called zamzar, and the resulting text file was perfect. So either zamzar has some super-duper decrypting system... or perhaps its just not very difficult.
EDIT: Just found that A-pdf also converted to text without problems.