2

I have been asked about creating a site where some users can upload Microsoft Word documents, then others can then search for uploaded documents that contain certain keywords. The site would be sitting on a Linux server running PHP and MySQL. I'm currently trying to find out if and how I can scrape this text from the documents. If anyone can suggest a good way of going about doing this it would be much appreciated.

Kara
  • 6,115
  • 16
  • 50
  • 57
Ultimate Gobblement
  • 1,851
  • 16
  • 23
  • What versions of Word? The old .DOC format or the new XML-based ones, or both? Would using a headless OpenOffice instance on your server be an option? – Pekka Nov 24 '10 at 10:47
  • Ideally it should be able to handle whatever the users chuck at it, so any version of word if possible. I have used OOo once in the past for converting docs to HTML, and that could be a good option my main worry is that calling it from a server script may use up too many resources. – Ultimate Gobblement Nov 24 '10 at 10:55

2 Answers2

4

Scraping text from the new docx format is trivial. The file itself is just a zip file, and if you look inside one, you will find a bunch of xml files. The text is contained in word/document.xml within this zip file, and all the actual user-entered text will appear in <w:t> tags. If you extract all text that appears in <w:t> tags, you will have scraped the document.

ZoFreX
  • 8,812
  • 5
  • 31
  • 51
  • Thanks for the explanation about docx. catdoc does not work with docx files, so I'm using a combo of that, and a little bash that does what you said found here: http://stackoverflow.com/questions/1184747/rtf-doc-docx-text-extraction-in-program-written-in-c-qt – Ultimate Gobblement Nov 25 '10 at 12:03
  • You might want to be wary of using such simple scripts to parse XML... parsing XML is very easy, but doing it with a bash script or regex could cause headaches if (for whatever reason) there's weird stuff floating around in those w:t tags. – ZoFreX Nov 26 '10 at 09:57
2

Here's a good example using catdoc:

function catdoc_string($str)
{
    // requires catdoc

    // write to temp file
    $tmpfname = tempnam ('/tmp','doc');
    $handle = fopen($tmpfname,'w');
    fwrite($handle,$a);
    fclose($handle);

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($tmpfname) .' 2>&1');

    // remove temp file
    unlink($tmpfname);

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

function catdoc_file($fname)
{
    // requires catdoc

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($fname) .' 2>&1');

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

Source

Ruel
  • 15,438
  • 7
  • 38
  • 49