PHP to get text and images from docx

Question

I am parsing a docx using PHP to extract the images and text in order using the following code -

    $zip = zip_open($filename);
    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        $zipEntryName = zip_entry_name($zip_entry);
        /*if(preg_match("([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)",$zipEntryName))
        {
            echo zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
        }*/
        if (strpos($zipEntryName, 'word/media') !== false)
        {
            # Removes 'word/media' prefix
            $imageName = substr($zipEntryName, 11);

            # Prevent EMF file extensions passing, as they are used by word rather than being manually placed
            if (substr($imageName, -3) == 'emf') continue;

            # Place the image assets into an array for future reference
            $imageAssets[$imageName] = array(
                'h' => 'auto',
                'w' => 'auto',
                'title' => $imageName,
                'id' => null,
                'data' => base64_encode(zip_entry_read($zip_entry, zip_entry_filesize($zip_entry))));
        }

        if ($zipEntryName != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }
    zip_close($zip);
    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $content = str_replace("\r\n", "\n", $content);
    $striped_content = strip_tags($content);

I am storing the files in an imageAssets array. The stripped content contains the entire text along with the image being converted to a random number. How do I map this number to the correct image.

[How Much Research Effort is Expected of Stack Overflow Users](https://meta.stackoverflow.com/a/261593/5827005) — GrumpyCrouton, Aug 14 '17 at 13:23
Look at this solution https://stackoverflow.com/questions/19503653/how-to-extract-text-from-word-file-doc-docx-xlsx-pptx-php — helmis.m, Aug 14 '17 at 16:40

score 0 · Answer 1 · answered Sep 18 '17 at 11:11

**Try this code **

$zip2 = new ZipArchive;
$zip2->open($filename);
$zip = zip_open($filename);
$zip2->open($filename);
$i=0;
if (!$zip || is_numeric($zip)) return false;

while ($zip_entry = zip_read($zip)) {

    if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

    $zipEntryName = zip_entry_name($zip_entry);
    if(preg_match("([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)",$zipEntryName))
    {
      //  echo zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
        $dataIn = $zip2->statIndex($i);
        $zip_element = $zip2->statIndex($i);

        $index = $zip_element['index'];
        echo "<image src='display.php?filename=".$filename."&index=".$index."' ><br />";
    }
    $i++;

    if (strpos($zipEntryName, 'word/media') !== false)
    {
        # Removes 'word/media' prefix
     $imageName = substr($zipEntryName, 11);

        # Prevent EMF file extensions passing, as they are used by word rather than being manually placed
        if (substr($imageName, -3) == 'emf') continue;

        # Place the image assets into an array for future reference
        $imageAssets[$imageName] = array(
            'h' => 'auto',
            'w' => 'auto',
            'title' => $imageName,
            'id' => null,
            'data' => base64_encode(zip_entry_read($zip_entry, zip_entry_filesize($zip_entry))));
    }

    if ($zipEntryName != "word/document.xml") continue;

    $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

    zip_entry_close($zip_entry);

}

zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$content = str_replace("\r\n", "\n", $content);
$striped_content = strip_tags($content);

and Add new File For Image Display(display.php) same folder

<?php

    /*Tell the browser that we want to display an image*/
    header('Content-Type: image/jpeg');

    /*Create a new ZIP archive object*/
    $zip = new ZipArchive;

    /*Open the received archive file*/
    if (true === $zip->open($_GET['filename'])) {

        /*Get the content of the specified index of ZIP archive*/
        echo $zip->getFromIndex($_GET['index']);

    }

    $zip->close();
    ?>

PHP to get text and images from docx

1 Answers1