0

Im building a web app using php and i have to count the words of an uploaded .doc or .docx file. So far im using the above functions in order to count the words but this code in not working for greek characters

for .doc

 public static function docWordCount($file){
  $fileHandle = fopen($file, "r");
  $line = @fread($fileHandle, filesize($file));   
  $lines = explode(chr(0x0D),$line);
  $outtext = "";
  foreach($lines as $thisline)
    {
      $pos = strpos($thisline, chr(0x00));
      if (($pos !== FALSE)||(strlen($thisline)==0))
        {
        } else {
          $outtext .= $thisline." ";
        }
    }
   $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
  return str_word_count($outtext);
 }

and for .docx:

  public static function docxWordCount($file){ 
    $striped_content = '';
    $content = '';

    $zip = zip_open($file);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return str_word_count($striped_content);   
  }
man_or_astroman
  • 648
  • 1
  • 17
  • 39

1 Answers1

-1

str_word_count does not seem to be binary-safe, meaning it does not support UTF-8 characters. Your best-bet is to use preg_match to split the text by non-word characters, using the \P{L} property. For example, the following regular expression will split your text in every non-letter character:

preg_split('/\P{L}/usi', $str, -1, , PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

See Unicode character properties for more information

periklis
  • 10,102
  • 6
  • 60
  • 68
  • My problem is not only the word count but also the conversion from doc to text, when i upload a greek doc the $outtext is like this: Y, dXiJ(x(I_TS1EZBmU/xYy5g/GMGeD3Vqq8K)fw9 xrxwrTZaGy8IjbRcXI – man_or_astroman Feb 16 '15 at 12:50
  • Word documents are not text files. You need to convert them first. See http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php – periklis Feb 16 '15 at 12:58
  • thank you for the help but please read my comment, the code i've posted first converts the .doc to text and then count the words.. – man_or_astroman Feb 16 '15 at 13:43
  • @man_or_astroman Sadly doesn't work with UTF8 non latin words http://sandbox.onlinephpfunctions.com/code/28621ab857d77df14638f8ecfbfe19e855ad4822 – fat_mike Nov 17 '18 at 17:18