0

I am able to extract the text content of a Docx File, I want to do the same for Doc file. I tried using the same code but could not read anything. I guess the reason is "Doc formats are not zipped archives." Here is the code:

  function readDocx ($filePath) 
    {


        // Create new ZIP archive

        $zip = new ZipArchive;
        $dataFile = 'word/document.xml';
        // Open received archive file
        if (true === $zip->open($filePath)) {
            // If done, search for the data file in the archive
            if (($index = $zip->locateName($dataFile)) !== false) {
                // If found, read it to the string
                $data = $zip->getFromIndex($index);
                // Close archive file
                $zip->close();

                // Load XML from a string
                // Skip errors and warnings

                $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

                $contents = explode('\n',strip_tags($xml->saveXML()));
                $text = '';
                foreach($contents as $i=>$content) {
                    $text .= $contents[$i];
                }
                return $text;
            }
            $zip->close();
        }
        return "";
    }

Please let me know if there is a way to fetch text content from Doc file.

Shankar Narayana Damodaran
  • 68,075
  • 43
  • 96
  • 126
Faizan Ansari
  • 117
  • 1
  • 10
  • No it's not that simple because it's not an XML document but a "Word Binary Document" there are readers out there for PHP which read them but it's the same complexity as reading a PDF. So you would have to use a prebuilt lib. See this post: http://stackoverflow.com/questions/7358637/reading-doc-file-in-php – TiMESPLiNTER Nov 12 '13 at 06:30
  • It's always nice to google first before posting a question. Most probably you're not first facing such a problem... – Havelock Nov 12 '13 at 06:33
  • Thank you TiMESPLiNTER would check out some libraries. Thank you Havelock, I did Google about it and could not find an exact solution. That's why I asked the question. Sometimes even similar questions don't get that visibility and most important when you are in a hurry to nail something, you do such mistakes. – Faizan Ansari Nov 12 '13 at 06:41
  • @MohammedFaizanAnsari please do allow me to disagree. SO questions get a *very good* visibility. Even Google's "auto suggest" shows you that you're not the first to ask such a question ;-) – Havelock Nov 12 '13 at 06:49
  • @Havelock: I am still clueless, the PHPWord library can only create a word file but not read it. There is no any proper solution or suggestion given anywhere. See nobody just comes and ask already asked questions, I asked because I failed to get a solution. Please help me if you know the proper solution. – Faizan Ansari Nov 13 '13 at 08:58

1 Answers1

4

Well I finally got the Answer, so thought I should share it here. I simply used COM Objects:

$DocumentPath="C:/xampp/htdocs/abcd.doc";

$word = new COM("word.application") or die("Unable to instantiate application object");

$wordDocument = new COM("word.document") or die("Unable to instantiate document object");

$word->Visible = 0;

$wordDocument = $word->Documents->Open($DocumentPath);

$HTMLPath = substr_replace($DocumentPath, 'html', -3, 3);

$wordDocument->SaveAs($HTMLPath, 3);

$wordDocument = null;

$word->Quit();

$word = null;

readfile($HTMLPath);

unlink($HTMLPath);
Faizan Ansari
  • 117
  • 1
  • 10