is there any way how to online (meaning as part of upload form, so in php/javasctipt) get number of characters with spaces of document saved in DOCX and ODT (and RTF if possible)? I mean, to get same character count as shown in Words statistics?
I know, that word has stored <characters>
in his app.xml
file, but that's not precise and probably with not spaces or I don't know well.
I've tried to do it simply - open the xmls, count chars and get it's value, but the problem is, that this way is neither accurate, see my code:
$document = 'cvicnytext2.docx';
function extracttext($filename) {
//Check for extension
$ext = explode(".", $filename);
$ext = end($ext);
//if its docx file
if($ext == 'docx')
$dataFile = "word/document.xml";
else
$dataFile = "content.xml";
$zip = new ZipArchive;
// Open the archive file
if (true === $zip->open($filename)) {
if (($index = $zip->locateName($dataFile)) !== false) {
$text = $zip->getFromIndex($index);
$xml = new DOMDocument();
$xml->loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
return strip_tags($xml->saveXML());
}
$zip->close();
}
return "File not found";
}
$length = strlen( utf8_decode( extracttext($document) ) );
echo "Length: ". $length."(chars with spaces).";
If I upload for example this file I get by my code 76015
characters, but Word shows 76113
so one hundred got missing somewhere.
Does anybody have any idea how to make it more precise? Your help will be appreciated.
Some more UPDATES
I've found that there is no big difference within:
used functions for counting the lenght - mb_strlen( $text )
and strlen( utf8_decode( $text ))
But what probably causes the issue is that reading the zip file causes some troubles - add space before and after the string and add some characters which are not printed but they are counted. Any idea? If I copy/pase the same text directly to the counting functions it works without troubles...