11

I saw this question PHP - Get number of pages in a Word document . I also need to determine the pages count from given word file (doc/docx). I tried to investigate phplivedocx/ZF (@hobodave linked to those in the original post answers), but I lost my hands and legs there. I can't use any outer web service either (like DOC2PDF sites, and then count the pages in the PDF version, or so...).

Simply: Is there any php code (using ZF or anything else in PHP, excluding COM object or other execution-files, such 'AbiWord'; I'm using shared Linux server, without exec or similar function), to find the pages count of word file?

EDIT: The word versions that about to be supported are Microsoft-Word 2003 & 2007.

Community
  • 1
  • 1
Yaakov Shoham
  • 10,182
  • 7
  • 37
  • 45
  • 1
    To which file-format standard(s) of a msword file are you referring to? Please add the specification if you want to get specific answers. – hakre Jan 24 '12 at 11:31

4 Answers4

22

Getting the number of pages for docx files is very easy:

function get_num_pages_docx($filename)
{
    $zip = new ZipArchive();

    if($zip->open($filename) === true)
    {  
        if(($index = $zip->locateName('docProps/app.xml')) !== false)
        {
            $data = $zip->getFromIndex($index);
            $zip->close();

            $xml = new SimpleXMLElement($data);
            return $xml->Pages;
        }

        $zip->close();
    }

    return false;
}

For 97-2003 format it's certainly challenging, but by no means impossible. The number of pages is stored in the SummaryInformation section of the document, but due to the OLE format of the files that makes it a pain to find. The structure is defined extremely thoroughly (though badly imo) here and simpler here. I looked at this for an hour today, but didn't get very far! (not a level of abstraction I'm used to), but output the hex to better understand the structure:

function get_num_pages_doc($filename) 
{
    $handle = fopen($filename, 'r');
    $line = @fread($handle, filesize($filename));

    echo '<div style="font-family: courier new;">';

        $hex = bin2hex($line);
        $hex_array = str_split($hex, 4);
        $i = 0;
        $line = 0;
        $collection = '';
        foreach($hex_array as $key => $string)
        {
            $collection .= hex_ascii($string);
            $i++;

            if($i == 1)
            {
                echo '<b>'.sprintf('%05X', $line).'0:</b> ';
            }

            echo strtoupper($string).' ';

            if($i == 8)
            {
                echo ' '.$collection.' <br />'."\n";
                $collection = '';
                $i = 0;

                $line += 1;
            }
        }

    echo '</div>';

    exit();
}

function hex_ascii($string, $html_safe = true)
{
    $return = '';

    $conv = array($string);
    if(strlen($string) > 2)
    {
        $conv = str_split($string, 2);
    }

    foreach($conv as $string)
    {
        $num = hexdec($string);

        $ascii = '.';
        if($num > 32)
        {   
            $ascii = unichr($num);
        }

        if($html_safe AND ($num == 62 OR $num == 60))
        {
            $return .= htmlentities($ascii);
        }
        else
        {
            $return .= $ascii;
        }
    }

    return $return;
}

function unichr($intval)
{
    return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}

which will out put code where you can find the sections such as:

007000: 0500 5300 7500 6D00 6D00 6100 7200 7900 ..S.u.m.m.a.r.y.
007010: 4900 6E00 6600 6F00 7200 6D00 6100 7400 I.n.f.o.r.m.a.t.
007020: 6900 6F00 6E00 0000 0000 0000 0000 0000 i.o.n...........
007030: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 

Which will allow you to see the referencing info such as:

007040: 2800 0201 FFFF FFFF FFFF FFFF FFFF FFFF (...ÿÿÿÿÿÿÿÿÿÿÿÿ
007050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
007060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
007070: 0000 0000 2500 0000 0010 0000 0000 0000 ....%...........

Which will allow you to determine properties described:

_ab = ("SummaryInformation") 
_cb = 0028
_mse = 02 (STGTY_STREAM) 
_bflags = 01 (DE_BLACK) 
_sidLeftSib = FFFF FFFF 
_sidRightSib = FFFF FFFF (none) 
_sidChild = FFFF FFFF (n/a for STGTY_STREAM) 
_clsid = 0000 0000 0000 0000 0000 0000 0000 0000 (n/a) 
_dwUserFlags = 0000 0000 (n/a) 
_time[0] = CreateTime = 0000 0000 0000 0000 (n/a) 
_time[1] = ModifyTime = 0000 0000 0000 0000 (n/a)
_startSect = 0000 0000 
_ulSize = 0000 1000 
_dptPropType = 0000 (n/a)

Which will let you find the relevant section of code, unpack it and get the page number. Of course this is the hard bit that I just don't have time for, but should set you in the right direction.

M$ don't make it easy!

Paul Norman
  • 1,621
  • 1
  • 9
  • 20
  • Wonderful! It's really excellent thing. I hope I'll success to complete the gap. – Yaakov Shoham Feb 04 '12 at 18:07
  • Specification document links are dead: Other locations - [Compound File Binary format](http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/WindowsCompoundBinaryFileFormatSpecification.pdf) and [Microsoft Office Word 97-2007 Binary File Format (.doc) Specification](http://www.digitalpreservation.gov/formats/digformatspecs/Word97-2007BinaryFileFormat(doc)Specification.pdf). – Orbling Sep 25 '13 at 17:26
  • Page Count is on page 120 of that document - the specification, it's stored in the Document Properties (tag: `DOP`) - at offset 46 (x2E), an `int` (2 byte), property name is `cPg` - reflects the last calculated count. So procedure is find DOP in the file table, then grab an integer from byte 42 of that table. – Orbling Sep 25 '13 at 17:37
  • With regard to the Compound File Binary format - there is a better explanatory file on Open Office: http://www.openoffice.org/sc/compdocfileformat.pdf - this has a very useful example section (Section 8), which helps make sense of the mess, which is essentially a crappy file system in a file. – Orbling Sep 25 '13 at 18:24
  • A lot of the work of reading the Compoung File Binary format is done here: https://code.google.com/p/binary-compound-file-reader/ Essentially all you need to do is open a filestream on the file, create an `OLEFile` with it, from that you can pull the allocation tables, if you can get the directory sector from the header, which is stored, pull that, entries are 128 bytes. Format as above in the answer - start sector for "WordDocument" should put you in the FIB, offset 402 is a long containing the location of DOP in the "Table" stream (another file). – Orbling Sep 25 '13 at 19:48
  • my docx name is UTF8 but zipArchive have problem for open this DOCX. https://stackoverflow.com/questions/45154025/php-ziparchive-dont-support-utf8-files-for-open?noredirect=1#comment77280333_45154025 – user3770797 Jul 18 '17 at 07:29
  • Just to note `$zip->close();` is called twice when it only needs to be called once – u01jmg3 Apr 16 '18 at 21:07
3

Have a look at PhpWord from microsoft codeplex ... "http://phpword.codeplex.com/

It will allow you to open and read the word formatted file in PHP and do whatever processing you require.

iWantSimpleLife
  • 1,944
  • 14
  • 22
2

To get meta data properties of doc,docx,ppt and pptx like number of pages, number of slides using PHP i followed the following process and it worked liked charm and iam so happy, below is the process i followed , hope it helps someone

Download and configure Apache Tika.

once its done you could try executing the following commadn it will give all the meta data about your file

java -jar tika-app-1.5.jar -m test.docx
java -jar tika-app-1.5.jar -m test.doc
java -jar tika-app-1.5.jar -m test.pptx
java -jar tika-app-1.5.jar -m test.ppt

once tested you can execute this comman in PHP script. Thanks.

opensource-developer
  • 2,826
  • 4
  • 38
  • 88
-1

Excluding using Abiword or OpenOffice? Impossible - number of pages will depend on number of words/letters, fonts used, justification and kerning, margin size, line spacing, paragraph spacing, number of paragraphs, columns, size of graphics / embedded objects, page / column breaks and page margins.

You need something which will can understand all of these.

Even if you use OpenOffice or Abiword, reflowing the text may change the number of pages. Indeed, in some cases opening the same document on a different instance of MSWord may result in a difference.

The best you could probably manage would be a statistical approach based on a representation of the document - but you'll still see huge variance.

symcbean
  • 47,736
  • 6
  • 59
  • 94
  • 1
    I've opend with 7zip both 2003 file (.doc) and 2007 file (.docx). In the 2007 extracted files I found XML file (docProps/app.xml) that includes explicitly the number of pages (`5`). In 2003 I didn't find an XMLs, but some other files, but in Windows Explorer you can look at the Properties of the file, in the Summary tab, in the Advanced part, and see the number of pages. I can't test it now, but I believe that this data doesn't calculated on-the-fly, but encapsulted in some way, explicitly, in the combined Word file. Actually, excatlly this number is what I need. – Yaakov Shoham Feb 01 '12 at 08:50
  • my docx name is UTF8 but zipArchive have problem for open this DOCX. https://stackoverflow.com/questions/45154025/php-ziparchive-dont-support-utf8-files-for-open?noredirect=1#comment77280333_45154025 – user3770797 Jul 18 '17 at 07:33