14

I'm trying to read .doc .docx file in php. All is working fine. But at last line I'm getting awful characters. Please help me. Here is code which is developed by someone.

    function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "k.doc";

Here is screenshot. enter image description here

no_freedom
  • 1,963
  • 10
  • 30
  • 48

6 Answers6

15

You can read .docx files in PHP but you can't read .doc files. Here is the code to read .docx files:

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) return false;

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}
$filename = "filepath";// or /var/www/html/file.docx

$content = read_file_docx($filename);
if($content !== false) {

    echo nl2br($content);
}
else {
    echo 'Couldn\'t the file. Please check that file.';
}
Echilon
  • 10,064
  • 33
  • 131
  • 217
user1817444
  • 145
  • 1
  • 3
  • 4
    Welcome on SO, here, it is a good practice to explain why to use your solution and not just how. That will make your answer more valuable and help further reader to have a better understanding of how you do it. I also suggest that you have a look on our FAQ : http://stackoverflow.com/faq. – ForceMagic Nov 12 '12 at 08:08
  • Thank you for your answer but how to write into that file? – Nitish Pareek May 06 '13 at 07:24
  • @user1817444 It is only reading text from doc, How to get the images with it? images as binary data will also works – Sanuj Sep 17 '16 at 11:13
8

DOC files are not plain text.

Try a library such as PHPWord (old CodePlex site).

nb: This answer has been updated multiple times as PHPWord has changed hosting and functionality.

Steve-o
  • 12,678
  • 2
  • 41
  • 60
4

I am using this function working well for me :) try it

function read_doc_file($filename) {
     if(file_exists($filename))
    {
        if(($fh = fopen($filename, 'r')) !== false ) 
        {
           $headers = fread($fh, 0xA00);

           // 1 = (ord(n)*1) ; Document has from 0 to 255 characters
           $n1 = ( ord($headers[0x21C]) - 1 );

           // 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
           $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

           // 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
           $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

           // 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
           $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

           // Total length of text in the document
           $textLength = ($n1 + $n2 + $n3 + $n4);

           $extracted_plaintext = fread($fh, $textLength);

           // simple print character stream without new lines
           //echo $extracted_plaintext;

           // if you want to see your paragraphs in a new line, do this
           return nl2br($extracted_plaintext);
           // need more spacing after each paragraph use another nl2br
        }
    }   
    }
  • 1
    This function works on to read the doc file but I guess only UTF encoded file. Can you please tell me why the other encoding does not work? I tried to read some file using this function and it does not work for all. The only difference I see is the encoding. – Ujjwal Prajapati Nov 06 '13 at 05:53
  • see my answer below for encoding issues – xchiltonx Feb 15 '15 at 01:28
  • After half an hour of searching a simple answer to read a doc file. I got this answer finally which solved my problem. – rahul Jan 05 '16 at 13:41
3

Decoding in pure PHP never worked for me, so here is my solution : http://wvware.sourceforge.net/

Install package

sudo apt-get install wv elinks

Use it in PHP :

$output = str_replace('.doc', '.txt', $filename);
shell_exec('/usr/bin/wvText ' . $filename . ' ' . $output);
$text = file_get_contents($output);
# Convert to UTF-8 if needed
if(!mb_detect_encoding($text, 'UTF-8', true))
{
    $text = utf8_encode($text);
}
unlink($output);
hugsbrugs
  • 3,501
  • 2
  • 29
  • 36
1

I also used it but for accents ( and single quotes like ' ) it would put � instead SOo my PDO mySQL didn't like it but I finally figured it out by adding

mb_convert_encoding($extracted_plaintext,'UTF-8');

So the final version should read:

function getRawWordText($filename) {
    if(file_exists($filename)) {
        if(($fh = fopen($filename, 'r')) !== false ) {
            $headers = fread($fh, 0xA00);
            $n1 = ( ord($headers[0x21C]) - 1 );// 1 = (ord(n)*1) ; Document has from 0 to 255 characters
            $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );// 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
            $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );// 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
            $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );// 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
            $textLength = ($n1 + $n2 + $n3 + $n4);// Total length of text in the document
            $extracted_plaintext = fread($fh, $textLength);
            $extracted_plaintext = mb_convert_encoding($extracted_plaintext,'UTF-8');
             // if you want to see your paragraphs in a new line, do this
             // return nl2br($extracted_plaintext);
             return ($extracted_plaintext);
        } else {
            return false;
        }
    } else {
        return false;
    }  
}

This works fine in a utf8_general_ci mySQL database to read word doc files :)

Hope this helps someone else

xchiltonx
  • 1,946
  • 3
  • 20
  • 18
1

I'm using soffice to convert doc to txt and read txt converted file

soffice --convert-to txt test.doc

you can see more in here

Kratos.vn
  • 25
  • 2