0

I've been using a very useful tool for reading Word documents submitted as an the accepted answer here: How to extract text from word file .doc,docx,.xlsx,.pptx php

It works quite well apart from sometimes it omits the first few lines of text from .doc files.

Here is the function to read a .doc file:

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}

It seems the issue is with this part:

$pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))

While this correctly removes the parts of the document which aren't the text content, it seems to sometimes be responsible for removing the first line of text content.

How could this function be amended to avoid this problem when reading .doc files?

Community
  • 1
  • 1
dlofrodloh
  • 1,728
  • 3
  • 23
  • 44
  • I'm not sure I understand the point of the `$pos` check. It seems to be attempting some stuff with null termination. What happens if you remove that line and simply check `if (strlen($thisline)==0)`? – Caius Jul 15 '16 at 22:46
  • It would seem all/most of the document coding has the 0x00 char in the line, so I assume the logic is that if that char exists in the line then skip it so you only output text content. If I replace the if statement with your suggestion, the document coding gets put in the output. I assume the problem is the 1st line of the text content sometimes has this character in it for whatever reason. – dlofrodloh Jul 16 '16 at 01:03

1 Answers1

0

I came up with the following workaround which seems to do the trick. I used strrpos instead of strpos to get the last occurrence in the line of the 00x0 character, because the text after it in the line is text content. If it's the last bit of document coding before the content starts, then it adds the text part of that line to the output.

private function read_doc() {
    $fileHandle = fopen($this->filename, "r");
    $line = @fread($fileHandle, filesize($this->filename));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    $content_started=false;
    foreach($lines as $thisline){
        $pos = strrpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0)){          
        } 
        else {
            if(!$content_started){
                $outtext.=substr($lastline,$lastpos)." ";
            }
            $content_started=true;
            $outtext .= $thisline." ";
        }
          $lastline=$thisline;
          $lastpos=$pos;
      }
    $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
}
dlofrodloh
  • 1,728
  • 3
  • 23
  • 44