3

I am trying to process a large collection of txt files which themselves are containers for the actual files that I am wanting to process. The txt files have sgml tags that set boundaries for the individual files I am processing. Sometimes, the contained files are binary that have been uuencoded. I have solved the problem of decoding the uuencoded files but as I was mulling over my solution I have determined that it is not general enough. That is, I have been using

if '\nbegin 644 ' in document['document']

to test if the file is uuencoded. I did some searching and have a vague understanding of what the 644 means (file permissions) and have then found other examples of uuencoded files that might have

if '\nbegin 642 ' in document['document']

or even some other alternates. Thus, my problem is how do I make sure that I capture/identify all of the subcontainers that have uuencoded files.

One solution is to test every subcontainer:

uudecode=codecs.getdecoder("uu")

for document in documents:
    try:
        decoded_document,m=uudecode(document)
    except ValueError:
         decoded_document=''
    if len(decoded_document)==0
        more stuff

This is not horrible, cpu-cycles are cheap but I am going to be handling some 8 million documents.

Thus, is there a more robust way to recognize whether or not a particular string is the result of uuencoding?

PyNEwbie
  • 4,882
  • 4
  • 38
  • 86

2 Answers2

2

Wikipedia says that every uuencoded file begins with this line

begin <perm> <name>

So probably a line matching the regexp ^begin [0-7]{3} (.*)$ denotes the beginning reliably enough.

9000
  • 39,899
  • 9
  • 66
  • 104
  • I appreciate this thought. I am concerned that I am not getting much though as opposed to trying to decode the file as I have to run the regexp. – PyNEwbie Jan 11 '11 at 21:47
  • A compiled regexp matches (or fails) very quickly. Maybe uudecode fails as fast and already includes this very step. The only way to determine is to actually try it on 2-3 thousand files and measure which is faster. – 9000 Jan 11 '11 at 22:00
  • 1
    Note that the file doesn't have to begin with 'begin' - most modern uudecodes will ignore anything up to the first begin - this was probably so you could pipe mail straight into it and not have to filter out headers. – Spacedman Jan 12 '11 at 15:01
1

Two ways:

(1) On Unix-based systems, you can robustly use the file command.

http://unixhelp.ed.ac.uk/CGI/man-cgi?file

$ file foo
foo: uuencoded or xxencoded text

(2) I also found the following (untested) Python code that looks like it will do what you want (at http://ubuntuforums.org/archive/index.php/t-1304548.html).

#!/usr/bin/env python
import magic
import sys
filename=sys.argv[1]
ms = magic.open(magic.MAGIC_NONE)
ms.load()
ftype = ms.file(filename)
print ftype
ms.close()
EmeryBerger
  • 3,897
  • 18
  • 29
  • Except for files with content before the 'begin' - which most uudecoders will skip. 'file' will probably report these as Ascii Text. Never mind you being on Windows, get Cygwin and then you can have all the Unix goodies. – Spacedman Jan 12 '11 at 15:02