You can try with this PHP class that claims to be able to convert both .doc
and .docx
files in textual format.
http://www.phpclasses.org/package/7934-PHP-Convert-MS-Word-Docx-files-to-text.html
According to the example given, that's how you can use it:
require("doc2txt.class.php");
$docObj = new Doc2Txt("test.docx");
//$docObj = new Doc2Txt("test.doc");
$txt = $docObj->convertToText();
echo $txt;
As you pointed out, the core function of this library, as of many others, is something like this:
<?php
function read_doc($filename)
{
$fileHandle = fopen($filename, "r");
$line = @fread($fileHandle, filesize($filename));
$lines = explode(chr(0x0D) , $line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE) || (strlen($thisline) == 0))
{
}
else
{
$outtext.= $thisline . " ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/_()]/", "", $outtext);
return $outtext;
}
echo read_doc("sample.doc");
?>
I've tested this function with a .doc file and it seems to work quite well. It needs some fixes with the last part of the document (there is still some random text that is generated at the end of the output), but with some fine tuning it works reasonably.
EDIT:
You are right, this functions works correctly only with .docx
documents (the document I tested was probably made using the same mechanism). Saving a file with .doc
extension, this function doesn't work!
The only help I'm able to give you right now is the .doc binary specifications link (here is an even more complete file), where you can actually see how the binary structure is made and extract the informations from there. I can't do it now, so I hope that somebody else may help you through this!