I have been asked about creating a site where some users can upload Microsoft Word documents, then others can then search for uploaded documents that contain certain keywords. The site would be sitting on a Linux server running PHP and MySQL. I'm currently trying to find out if and how I can scrape this text from the documents. If anyone can suggest a good way of going about doing this it would be much appreciated.
Asked
Active
Viewed 1,843 times
2
-
What versions of Word? The old .DOC format or the new XML-based ones, or both? Would using a headless OpenOffice instance on your server be an option? – Pekka Nov 24 '10 at 10:47
-
Ideally it should be able to handle whatever the users chuck at it, so any version of word if possible. I have used OOo once in the past for converting docs to HTML, and that could be a good option my main worry is that calling it from a server script may use up too many resources. – Ultimate Gobblement Nov 24 '10 at 10:55
2 Answers
4
Scraping text from the new docx format is trivial. The file itself is just a zip file, and if you look inside one, you will find a bunch of xml files. The text is contained in word/document.xml within this zip file, and all the actual user-entered text will appear in <w:t> tags. If you extract all text that appears in <w:t> tags, you will have scraped the document.

ZoFreX
- 8,812
- 5
- 31
- 51
-
Thanks for the explanation about docx. catdoc does not work with docx files, so I'm using a combo of that, and a little bash that does what you said found here: http://stackoverflow.com/questions/1184747/rtf-doc-docx-text-extraction-in-program-written-in-c-qt – Ultimate Gobblement Nov 25 '10 at 12:03
-
You might want to be wary of using such simple scripts to parse XML... parsing XML is very easy, but doing it with a bash script or regex could cause headaches if (for whatever reason) there's weird stuff floating around in those w:t tags. – ZoFreX Nov 26 '10 at 09:57
2
Here's a good example using catdoc:
function catdoc_string($str)
{
// requires catdoc
// write to temp file
$tmpfname = tempnam ('/tmp','doc');
$handle = fopen($tmpfname,'w');
fwrite($handle,$a);
fclose($handle);
// run catdoc
$ret = shell_exec('catdoc -ab '.escapeshellarg($tmpfname) .' 2>&1');
// remove temp file
unlink($tmpfname);
if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
return false;
}
return trim($ret);
}
function catdoc_file($fname)
{
// requires catdoc
// run catdoc
$ret = shell_exec('catdoc -ab '.escapeshellarg($fname) .' 2>&1');
if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
return false;
}
return trim($ret);
}

Ruel
- 15,438
- 7
- 38
- 49
-
Cool that looks as if it should be able to do the trick. I'll look into it. Thanks – Ultimate Gobblement Nov 24 '10 at 10:58