I've been using a very useful tool for reading Word documents submitted as an the accepted answer here: How to extract text from word file .doc,docx,.xlsx,.pptx php
It works quite well apart from sometimes it omits the first few lines of text from .doc files.
Here is the function to read a .doc file:
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
It seems the issue is with this part:
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
While this correctly removes the parts of the document which aren't the text content, it seems to sometimes be responsible for removing the first line of text content.
How could this function be amended to avoid this problem when reading .doc files?