7

I'm doing a project now, and I'm stuck with reading word documents.

Word File content.

This is a test word file in PHP.

Thank you.

PHP code.

    $myFile = "wordfile.docx";
    $fh = fopen($myFile, 'r');
    $theData = fread($fh, 1000);
    fclose($fh);
    echo $theData;

output:

PK!éQ°Â[Content_Types].xml ¢( ´”MOÂ@†ï&þ‡f¯¦]ð`Œ¡pP<*‰Ïëv
 «Ýì,_ÿÞiI¡(ziÒNß÷}fÚÞ`©‹h•5)ë&‘6Sf’²×ñc|Ë"Âd¢°R¶dƒþåEo
 ¼r€© ¦l‚»ãå´ÀÄ:0TÉ­×"ЭŸp'䧘¿îtn¸´&€  q(=X¿÷¹˜!.éñ
 š„ä,º_¿WF¥L8W()ò²Êu <"œ›l.Þ%¤¬Ìqª^Nøp0ÙKPºl­*Õ3Ó
 «¢‘ðáIhbçë3žY9ÓÔwr¼¹F›çJB­/Ýœ·é;é"©+Z(³e?ÈaUþ=ÅÚ÷Ä
 ø7¦Ã<I?Hû<4ÆeÓÉ:bGÛž!ÐN    ùþÛÆmCÇs+ÂÞ_þbǼ$§ó4ïœ
 0ñ£¶n…´#€W×îٕͱH:#oÒÎñ¿h{»JuLGÎ êõÐtÄêDZXg÷åFÌ kÈæÕîÿÿPK
 !ÇÂ'¼ß_rel

IS there anyway to read the word document in PHP ?

Othman
  • 2,942
  • 3
  • 22
  • 31
  • Possible duplicate of http://stackoverflow.com/questions/7144023/opening-word-document-with-read-mode-using-php –  May 18 '12 at 03:48
  • @Webtecher I've tried it I got this error. `Fatal error: Class 'COM' not found` – Othman May 18 '12 at 03:53
  • There is a really great resource on reading word documents: http://stackoverflow.com/questions/188452/reading-writing-a-ms-word-file-in-php. – Brett May 18 '12 at 05:03

5 Answers5

17

For docx use this function

function read_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) return false;

    $zip = zip_open($filename);
    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }
    zip_close($zip);      
    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}

It will return text from docx

Sudhir
  • 835
  • 11
  • 31
  • but the format is changing. What to do to keep the format same? – Rohan Gala Mar 19 '16 at 17:14
  • @RohanGala it will read docx word file and return it content. Can you show me which format you get? – Sudhir Mar 21 '16 at 03:26
  • 1
    Format as in the spaces and blank lines are not shown. But the text is obtained correctly – Rohan Gala Mar 21 '16 at 08:38
  • This works good but skips the first few lines often for some reason – dlofrodloh Jul 15 '16 at 17:14
  • strip_tags() will remove all the xml that contain inline-style and or class; you would need those and interpret/apply these in some ways to restore the styling. – jrgd Feb 09 '19 at 09:26
  • Yeah this works pretty good and is really simple. From what I can see all the text is there. The only thing that's a bit of a pain is lots of numbers embedded, e.g. `-540385322897565151028479750031953205141595` . Probably some layout stuff but not a train smash. – Eugene van der Merwe May 21 '19 at 16:49
6

"PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats." (PHPOffice, 2016)

This open php library should solve your problem. you can eighter download it oder get it by composer:

https://github.com/PHPOffice/PHPWord

user2912903
  • 182
  • 2
  • 4
3

The following is a similar function to the one in @suhdir's answer, but for PHP 8:

    function readDocx($filename)
    {

        $zip = new ZipArchive();
        if ($zip->open($filename)) {
            $content = $zip->getFromName("word/document.xml");
            $zip->close();
            $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
            $content = str_replace('</w:r></w:p>', "\r\n", $content);

            return strip_tags($content);
        }
        return false;

    }

Zip functions have been deprecated in PHP 8 and substituted by ZipArchive.

Marina DU
  • 81
  • 2
  • 3
  • 1
    I am using PHP7 and got the deprecated warning for @Sudhir's answer. Also, I tried phpWord and it didn't work with my word files created by MS Word or Google Docs. This short code just worked for both. This should be marked as the answer, thank you. – hapablap Dec 04 '21 at 10:30
2

"docx" is different from "doc". Docx files are basically xml files in a zipfile container (as described by wikipedia). Doc files are binary blobs.

I am aware of no library that can easily read docx files in php (although Phpdocx can write them). However, since these are just zip files and xml files, you should be able do put something together using ZipArchive to open the docx container and DOMDocument or SimpleXML or XMLReader or XSLTProcessor to read the xml documents themselves.

Francis Avila
  • 31,233
  • 6
  • 58
  • 96
1

Word document isn't stored conveniently like a text file (it's more like xml / binary file), so you can't just use echo and expects it to output the human readable portion of the docx file.

There's a library that could do what you want, but it takes only doc file

Docvert

Andreas Wong
  • 59,630
  • 19
  • 106
  • 123