2

I'm trying to parse PDF files into plain text (strings) with pure PHP, because I've no access to exec or system or other function denied by the server I'm working on.

Those PDF files can't be parsed by the functions I found online.

This is what I get from an echo file_get_contents("file.pdf");

%PDF-1.4 5 0 obj << /Type /XObject /Subtype /Image /Filter /DCTDecode /Length 6536 /Width 200 /Height 125 /BitsPerComponent 8 /ColorSpace /DeviceRGB >> stream ÿØÿàJFIFÿÛC  %# , #&')*)-0-(0%()(ÿÛC   ((((

and then all the content. So this is a PDF 1.4 5 0.

Here you are the function I was using for PDF 1.2-1.3 (not working with those files):

function decomprimiPDF($pdfdata) {
if (strlen ($pdfdata) < 1000 && file_exists ($pdfdata)) 
    $pdfdata = file_get_contents ($pdfdata);
$result = ''; 
if (preg_match_all ('/<<[^>]*FlateDecode[^>]*>>\s*stream(.+)endstream/Uis', $pdfdata, $m)) 
    foreach ($m[1] as $chunk) {
        $chunk = gzuncompress(ltrim ($chunk)); 
        $a = preg_match_all ('/\[([^\]]+)\]/', $chunk, $m2) ? $m2[1] : array ($chunk); 
        foreach ($a as $subchunk) {
            if (preg_match_all ('/\(([^\)]+)\)/', $subchunk, $m3)) {
                $result .= (join ('', $m3[1]) . '*');
            }
        }
}

Anyone here can help me with a function in PHP (I repeat it, I tried almost any function already online, and also a few classes, but they don't work with the PDF files I'm talking about).

Thanks for your support ;)

0 Answers0