How to count characters with spaces in docx/odt(rtf) files?

Question

is there any way how to online (meaning as part of upload form, so in php/javasctipt) get number of characters with spaces of document saved in DOCX and ODT (and RTF if possible)? I mean, to get same character count as shown in Words statistics?

I know, that word has stored <characters> in his app.xml file, but that's not precise and probably with not spaces or I don't know well.

I've tried to do it simply - open the xmls, count chars and get it's value, but the problem is, that this way is neither accurate, see my code:

$document = 'cvicnytext2.docx';

function extracttext($filename) {
    //Check for extension
    $ext = explode(".", $filename);
    $ext = end($ext);

    //if its docx file
    if($ext == 'docx')
    $dataFile = "word/document.xml";
    else
    $dataFile = "content.xml";     

    $zip = new ZipArchive;

    // Open the archive file
    if (true === $zip->open($filename)) {
        if (($index = $zip->locateName($dataFile)) !== false) {
            $text = $zip->getFromIndex($index);
            $xml = new DOMDocument();
            $xml->loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            return strip_tags($xml->saveXML());
        }
        $zip->close();
    }
    return "File not found";
}

$length = strlen( utf8_decode( extracttext($document) ) );
echo "Length: ". $length."(chars with spaces).";

If I upload for example this file I get by my code 76015 characters, but Word shows 76113 so one hundred got missing somewhere.

Does anybody have any idea how to make it more precise? Your help will be appreciated.

Some more UPDATES

I've found that there is no big difference within: used functions for counting the lenght - mb_strlen( $text ) and strlen( utf8_decode( $text ))

But what probably causes the issue is that reading the zip file causes some troubles - add space before and after the string and add some characters which are not printed but they are counted. Any idea? If I copy/pase the same text directly to the counting functions it works without troubles...

Eduardo Ramos · Answer 1 · 2015-06-14T13:52:17.697

I believe that you are approach is basically the only available one if you do not want to get into the nitty-gritty details of the ODF or OOXML standard.

To have an exact count you will first need to remove the nodes that are "not printed" but yet may contain some text like, for example, the titles and descriptions of images and objects, ...

You may have an slight improvement if you write a recursive function that does the process of getting content for every single node via nodeValue and you trim the result but that will still take into account "non-printable text in some nodes"

How to count characters with spaces in docx/odt(rtf) files?

1 Answers1