2

I have tried many things like How to extract text from word file .doc,docx,.xlsx,.pptx php. But this isn't a solution.

My server is Linux based so enabling extension=php_com_dotnet.dll is not the solution.

Another solution was installing LIBRE office on server and converting the .doc file to .txt on the fly and then counting the words from that file. This is very tedious job and time consuming.

I just need a simple php script that removes the special characters from the .doc file and count the number of words.

Community
  • 1
  • 1
Sumit Nayak
  • 307
  • 3
  • 13
  • 1
    .doc is almost unparsable by any software that isn't Microsoft Word. I'd recommend using some other file format if at all possible. – GordonM Jul 09 '14 at 12:50
  • I am developing a wordpress pluing to count number of words from the file. That also include the .doc files, as you know they are the widely used extensions. So i cant ignore .doc files – Sumit Nayak Jul 09 '14 at 13:18

3 Answers3

3

You can try with this PHP class that claims to be able to convert both .doc and .docx files in textual format.

http://www.phpclasses.org/package/7934-PHP-Convert-MS-Word-Docx-files-to-text.html

According to the example given, that's how you can use it:

require("doc2txt.class.php");

$docObj = new Doc2Txt("test.docx");
//$docObj = new Doc2Txt("test.doc");

$txt = $docObj->convertToText();
echo $txt;

As you pointed out, the core function of this library, as of many others, is something like this:

<?php

 function read_doc($filename)
 {
    $fileHandle = fopen($filename, "r");
    $line = @fread($fileHandle, filesize($filename));
    $lines = explode(chr(0x0D) , $line);
    $outtext = "";
    foreach($lines as $thisline)
        {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE) || (strlen($thisline) == 0))
            {
            }
          else
            {
            $outtext.= $thisline . " ";
            }
        }

    $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/_()]/", "", $outtext);
    return $outtext;
 }

 echo read_doc("sample.doc");

?>

I've tested this function with a .doc file and it seems to work quite well. It needs some fixes with the last part of the document (there is still some random text that is generated at the end of the output), but with some fine tuning it works reasonably.

EDIT: You are right, this functions works correctly only with .docx documents (the document I tested was probably made using the same mechanism). Saving a file with .doc extension, this function doesn't work! The only help I'm able to give you right now is the .doc binary specifications link (here is an even more complete file), where you can actually see how the binary structure is made and extract the informations from there. I can't do it now, so I hope that somebody else may help you through this!

clami219
  • 2,958
  • 1
  • 31
  • 45
  • Yes i have tested this. All over the stackflow i have seen the same function code written. But this isn't working mate I'm able to count words from .docx file but **.doc** is the main problem here – Sumit Nayak Jul 09 '14 at 13:02
  • It's much longer than I thought, but it's not impossible! Check my last edit! – clami219 Jul 09 '14 at 13:28
  • Hope I also can try this and make a function. Thanks for the help mate. If i succeed in making the function will let you know – Sumit Nayak Jul 09 '14 at 13:33
  • @clami219 How did you do the finetuning to get rid of the random text at the end of the output for read_doc()? – tholu Aug 13 '15 at 12:33
  • @tholu I actually didn't. Sorry! :( – clami219 Aug 14 '15 at 14:17
2

At the end i had to use Libreoffice. But its very efficient to use it. It solved my all the problem.

So my advice would be to install the 'HEADLESS' package of libreoffice on server and use the command line conversion

Sumit Nayak
  • 307
  • 3
  • 13
2

I've built a tool that incorporates various methods found around the web and on Stack Overflow that provides word, line and page counts for doc, docx, pdf and txt files. I hope it's of use to people. If anyone can get rtf working with it I'd love a pull request! https://github.com/joeblurton/doccounter

mimsy
  • 41
  • 6