1

Using python 3.5

I need to find specific text that's stored in old-style, 1997-2003 windows .doc files and dump it into a csv. My constraints are:

a) doc files are in a zipped archive: I can't write to disk/I need to work in memory

b) I need to find specific text with regex, so the doc need to be converted to .txt

Ideally I could read the files with zipfile, pass the data on to some doc-to-txt converter (e.g. textract), and regex on the txt. This might look like

import zipfile
import textract
import re

    with zipfile.ZipFile(zip_archive, 'r') as f:
    for name in f.namelist():
        data = f.read(name)
        txt = textract.process(data).decode('utf-8')  
        #some regex on txt

This of course doesn't work, because the argument for textract (and any other doc-to-txt converter) is a filepath, while "data" is bytes. Using "name" as the argument gives a MissingFileError, probably because zip archives don't have directory structures, just filenames simulating paths.

Is there any way to regex through zipped doc files only in memory, without extracting the files (and therefore writing them to disk)?

Todd
  • 4,669
  • 1
  • 22
  • 30
  • 1
    After skimming the code of docx2txt I think it may be possible to give a file-like object to its "process" function as file. If so you just have to wrap the "data" in a "BytesIO" object. – Michael Butscher Mar 19 '20 at 01:16
  • 1
    What exactly is preventing you from writing to disk? – Karl Knechtel Mar 19 '20 at 02:16
  • 2
    @MichaelButscher in that case you can probably use [`ZipFile.open`](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.open) in order to get a file-like object straight from the zip file. – Masklinn Mar 19 '20 at 10:22
  • @KarlKnechtel this is all for reproducible academic research (I'm extracting info from employment rolls of Romanian judiciary) so I wanted code that anyone can plug into python and run, without worrying what permissions they have, etc. I've run into problems extracting thousands of files then running through them before, especially on machines that are synced. – Radu Parvulescu Mar 19 '20 at 22:27
  • @RaduParvulescu, alright. I wrote up several options for you. I like the RAM drive idea the best personally. Also, the idea of mounting the zip file to the FS seems interesting. Can we update the wording of the question title? – Todd Mar 20 '20 at 20:39
  • @RaduParvulescu, if the problem is portability rather than a hard restriction against writing to disk, how about a workflow that writes to disk _one_ file from zip, processes it, and deletes it before extracting the next one? It would make your life a lot easier. (And would not blow up your disk with large archives that might expand 10 x or more when unzipped.) – alexis Mar 21 '20 at 02:41

1 Answers1

3

Working with files without writing to a physical drive

In most cases, the files within a zip have to be extracted first to be processed. But this can be done in memory. The roadblock is how to invoke a utility that takes only a mapped filesystem path as an argument to process the text in the zipped files without writing to the physical drive.

Internally textract invokes a command line utility (antiword) that does the actual text extraction. So the approach that solves this could be applied generally to other command line tools that need access to zip contents via a filesystem path.

Below are several possible solutions to get around this restriction on files:

  1. Mount a RAM Drive.
    • This works well, but requires sudo prompt, but that can be automated.
  2. Mount the zip file to the filesystem. (good option)
    • A good Linux tool for mounting these is fuse-zip.
  3. Use the tempfile module. (easiest)
    • Ensures files are automatically deleted.
    • Drawback, files may be written to disk.
  4. Access the XML within the .docx files.
    • Can regex through the raw XML, or use an XML reader.
    • Only a small portion of your files are .docx though.
  5. Find another extractor. (not covered)
    • I looked and couldn't find anything.
    • docx2txt is another Python module, but it looks like it will only handle .docx files (as its name implies) and not old Word .doc files.

Why did I do all this leg-work, you may wonder. I actually found this useful for one of my own projects.


1) RAM Drive

If tempfile doesn't satisfy the file constraint goals, and you want to ensure all files used by the tool are in RAM, creating a RAM drive is a great option. The tool should unmount the drive when it's done, which will delete all the files it stored.

A plus with this option is that Linux systems all support this natively. It doesn't incur any additional software dependencies; at least for Linux, Windows will probably require ImDisk.

These are the relevant bash commands on Linux:

$ mkdir ./temp_drive
$ sudo mount -t tmpfs -o size=512m temp_drive ./temp_drive
$ 
$ mount | tail -n 1     # To see that it was mounted.
$ sudo umount ./temp_drive   # To unmount.

On MacOS:

$ diskutil erasevolume HFS+ 'RAM Disk' `hdiutil attach -nomount ram://1048576 `
$ # 512M drive created: 512 * 2048 == 1048576

On Windows:

On Windows, you may have to use a 3rd party application like ImDisk:

To automate the process, this short script prompts the user for their sudo password, then invokes mount to create a RAM drive:

import subprocess as sp
import tempfile
import platform
import getpass

ramdrv = tempfile.TemporaryDirectory()

if platform.system() == 'Linux':

    sudo_pw = getpass.getpass("Enter sudo password: ")

    # Mount RAM drive on Linux.
    p = sp.Popen(['sudo', '-S', 'bash', '-c', 
                 f"mount -t tmpfs -o size=512m tmpfs {ramdrv.name}"], 
                 stderr=sp.STDOUT, stdout=sp.PIPE, stdin=sp.PIPE, bufsize=1,
                 encoding='utf-8')

    print(sudo_pw, file=p.stdin)

    del sudo_pw

    print(p.stdout.readline())

elif platform.system() == 'Darwin':
    # And so on...

Whatever GUI package your application uses likely has a password dialog, but getpass works well for console applications.

To access the RAM drive, use the folder it's mounted on like any other file in the system. Write files to it, read files from it, create subfolders, etc.


2) Mount the Zip file

If the Zip file can be mounted on the OS file system, then its files will have paths that can be passed to textract. This could be the best option.

For Linux, a utility that works well is fuse-zip. The few lines below install it, and mount a zip file.

$ sudo apt-get install fuse-zip
...
$ mkdir ~/archivedrive
$
$ fuse-zip ~/myarchive.zip ~/archivedrive
$ cd ~/archivedrive/myarchive           # I'm inside the zip!

From Python, create temporary mount point, mount the zip, extract text, then unmount the zip:

>>> import subprocess as sp, tempfile, textract
>>>
>>> zf_path = '/home/me/marine_life.zip'
>>> zipdisk = tempfile.TemporaryDirectory()           # Temp mount point.
>>> 
>>> cp = sp.run(['fuse-zip', zf_path, zipdisk.name])  # Mount.
>>> cp.returncode
0
>>> all_text = textract.process(f"{zipdisk.name}/marine_life/octopus.doc")
>>> 
>>> cp = sp.run(['fusermount', '-u', zipdisk.name])   # Unmount.
>>> cp.returncode
0
>>> del zipdisk                                       # Delete mount point.
>>> all_text[:88]
b'The quick Octopuses live in every ocean, and different species have\n
adapted to different'
>>>
>>> # Convert bytes to str if needed.
>>> as_string = all_text.decode('latin-1', errors='replace')

A big plus with using this approach is it doesn't require using sudo to mount the archive - no prompting for a password. The only drawback would be that it adds a dependency to the project. Probably not a major concern. Automating the mounting and unmounting should be easy with subprocess.run().

I believe that the default configuration for Linux distros allows users to mount Fuse filesystems without the need to use sudo; but that would need to be verified for the supported targets.

For Windows, ImDisk can also mount archives and has a command line interface. So that could possibly be automated to support Windows. The XML approach and this approach both are nice because they get the information directly from the zip file without the additional step of writing it out to a file.

Regarding character encodings: I made the assumption in the example that old Eastern European Word documents that predate 2006 might use some encoding other than 'utf-8' (iso-8859-2, latin-1, windows-1250, cyrillic, etc.). You might have to experiment a bit to ensure that each of the files is converted to strings correctly.

Links:


3) tempfile.NamedTemporaryFile

This approach doesn't require any special permissions. It should just work. However, the files it creates aren't guaranteed to be in memory only.

If the concern is that your tool will overpopulate the users' drives with files, this approach would prevent that. The temp files are reliably deleted automatically.

Some sample code for creating a NamedTemporaryFile, opening a zip and extracting a file to it, then passing its path to textract.

>>> zf = zipfile.ZipFile('/temp/example.docx')
>>> wf = zf.open('word/document.xml')
>>> tf = tempfile.NamedTemporaryFile()
>>>
>>> for line in wf:
...     tf.file.write(line)
>>>
>>> tf.file.seek(0) 
>>> textract.process(tf.name)

# Lines and lines of text dumped to screen - it worked!

>>> tf.close()
>>>
>>> # The file disappears.

You can reuse the same NamedTemporaryFile object over and over using tf.seek(0) to reset its position.

Don't close the file until you're done with it. It will vanish when you close it. Instances of NamedTemporaryFile are automatically deleted when closed, their refcount goes to 0, or your program exits.

An option if you want to have a temporary folder that's ensured to disappear after your program is done is tempfile.TemporaryDirectory.

In the same module, tempfile.SpooledTemporaryFile is a file that exists in memory. However, the path to these is difficult to get (we only know the file descriptor of these). And if you do find a good way to retrieve a path, the path is not usable by textract.

textract runs in a separate process, but it inherits the file handles of the parent. That's what makes it possible to share these temp files between the two.


4) Word.docx text extraction via XML

This approach attempts to remove the need for the 3rd party utility by doing the work within Python, or using another tool that doesn't require FS paths.

The .docx files within the zip files are also zip files containing XML. XML is text and it can be parsed raw with regular expressions, or passed to an XML reader first.

The Python module, docx2txt does pretty much the same thing as the 2nd example below. I looked at its sources and it opens the Word document as a zip, and uses an XML parser to get the text nodes. It's not going to work for the same reasons as this approach.

The two examples below read the file directly out of the .docx archive - the file isn't extracted to disk.

If you want to convert the raw XML text to a dictionary and lists, you can use xmltodict:

import zipfile
import xmltodict

zf        = zipfile.ZipFile('/temp/example.docx')
data      = xmltodict.parse(zf.open('word/document.xml'))
some_text = data['w:document']['w:body']['w:p'][46]['w:r']['w:t']

print(some_text)

I found this format a bit unwieldy because of the complicated nesting structure of the XML elements, and it doesn't give you the advantages an XML reader does as far as locating nodes.

Using xml.etree.ElementTree, an XPATH expression can extract all the text nodes in one shot.

import re
import xml.etree.ElementTree as ET
import zipfile

_NS_DICT = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

def get_docx_text(docx_path):
    """
    Opens the .docx file at 'docx_path', parses its internal document.xml
    document, then returns its text as one (possibly large) string.
    """
    with zipfile.ZipFile(docx_path) as zf:
        tree = ET.parse(zf.open('word/document.xml'))
    all_text = '\n'.join(n.text for n in tree.findall('.//w:t', _NS_DICT))
    return all_text

Using the xml.etree.ElementTree module as above makes text extraction possible in only a few lines of code.

In get_docx_text(), this line grabs all the text:

all_text = '\n'.join(n.text for n in tree.findall('.//w:t', _NS_DICT))

The string: './/w:t' is an XPATH expression that tells the module to select all the t (text) nodes of the Word document. Then the list comprehension concatenates all the text.

Once you have the text returned from get_docx_text(), you can apply your regular expressions, iterate over it line-by-line, or whatever you need to do. The example re expression grabs all parenthetical phrases.


Links

The Fuse filesystem: https://github.com/libfuse/libfuse

zip-fuse man page: https://linux.die.net/man/1/fuse-zip

MacOS Fuse: https://osxfuse.github.io/

ImDisk (Windows): http://www.ltr-data.se/opencode.html/#ImDisk

List of RAM drive software: https://en.wikipedia.org/wiki/List_of_RAM_drive_software

MS docx file format: https://wiki.fileformat.com/word-processing/docx/

The xml.ElementTree doc: https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xml%20etree#module-xml.etree.ElementTree

XPATH: https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xml%20etree#elementtree-xpath

The XML example borrowed some ideas from: https://etienned.github.io/posts/extract-text-from-word-docx-simply/

Todd
  • 4,669
  • 1
  • 22
  • 30
  • thank you @Todd, your second option is very elegant. Unfortunately when I ran it I got the error "xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2", so probably something wrong with docx formatting. Since only ~30 of the ~4000 files were docx (the rest were 1997-2003 .doc) this was not a mystery worth solving: I just converted them to old-style .doc This leaves the problem of how to read out text from old-style docs (my fault I didn't clarify that I needed to do this, only my second question on stack) without going on disk. – Radu Parvulescu Mar 19 '20 at 22:30
  • mind=blown. Sorry for delayed response, have a 3 week-old baby... Tried option 3, looks best for cross-platform. It errors out with `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte`. No utf-8 problems when working on disk. One hour later and https://stackoverflow.com/questions/9845842/bytes-in-a-unicode-python-string/9846246#9846246 is probably right that something's garbling up the unicode. Maybe zipfile.open and/or writing to tempfile are adding a little something that screws up unicode decoding afterwards? – Radu Parvulescu Mar 24 '20 at 02:23
  • @RaduParvulescu, try decoding your string data first before passing it to any other object. `all_text.decode('utf-8', errors='replace')` Use `errors='replace'` or `errors='ignore'` Also make sure your files are actually in utf-8 and not latin-1 or iso-8859-2, or something else. – Todd Mar 25 '20 at 08:33
  • @RaduParvulescu, were you able to find a solution to your file processing? I think the data corruption problem was due to the wrong encoding being applied, and opening files in the wrong mode. Let me know if you have any questions on that. – Todd Apr 03 '20 at 19:02
  • please forgive the multi-month delay on my answer accept, it was barbaric. Option three worked, I got around the encoding problem by making a temp directory as you suggested, extracting the zip archive there, taking what I needed from my files, then closing context and poof goes the tempdir. Just basic `tempfile.TemporaryDirectory` functionality. Again, thanks a lot – Radu Parvulescu Jul 31 '20 at 02:20
  • I'm glad you were able to overcome the barbarism and get this to work =) – Todd Jul 31 '20 at 09:51