4

I regularly receive emails with attachments that I must extract and save to disk. I do essentially the following (in Python 2.7):

message = email.message_from_file(sys.stdin)
for part in message.walk():
    path = email.header.decode_header(part.get_filename())[0][0]
    content = part.get_payload(decode=True)
    with open(path, 'w') as f:
        f.write(content)

This approach has worked for all types of attachments and all flavors of Content-Transfer-Encoding that I've received so far except when the attachment is a ZIP file and the Content-Transfer-Encoding is 'quoted-printable'. In those cases the ZIP file that gets written has one fewer byte (around 60-80% of the way through the file) than the original, and unzip reports errors like:

% unzip -l foo.zip
Archive:  foo.zip
error [foo.zip]:  missing 1 bytes in zipfile
  (attempting to process anyway)
  Length      Date    Time    Name
---------  ---------- -----   ----
   440228  01-00-1980 00:00   foo - bar.csv
---------                     -------
   440228                     1 file

and

% unzip foo.zip 
Archive:  foo.zip
error [foo.zip]:  missing 1 bytes in zipfile
  (attempting to process anyway)
error [foo.zip]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
  (attempting to re-compensate)
  inflating: foo - bar.csv   bad CRC 4c86de66  (should be a53f73b1)

The result of the unzip then differs in size by about .01% from the original CSV, and the final 20-40% or so of the file is garbled.


Now, the code handles ZIP files attached as 'base64' just fine, and it handles other content (Excel files, csv files) attached as 'quoted-printable' just fine. I know that the ZIP attachment content is uncorrupt enough that my regular email reader can save it to disk just fine and extract the original content flawlessly. (Is it possible that real email readers are performing some error correction when saving the attachment that my Python is not doing?)

Is there a known issue with Python unable to read ZIP files being sent as quoted-printable? Are there other functions from Python's email package I can try to correctly decipher this content?

dg99
  • 5,456
  • 3
  • 37
  • 49

1 Answers1

6

The problem in this case is that the sender's binary attachments (ZIP files) are poorly written, such that they contain \r\n sequences. That is, the ZIP-formatted file itself (not the file being zipped) contains occasional CRLF pairs. I can't speculate how those are getting into the ZIP output; I don't think any commercial or open-source zipper would include CRLF in its output ...

According to Rule #4 of quoted-printable encoding, line breaks in the original "text" (in this case the ZIP attachment) must be represented as bare \r\n in the encoding (and then interpreted however the decoder's locale dictates). Clearly this is seriously bad when the exact form of the line break has meaning (such as when it itself is an encoding). And the RFC even comments about the weirdness of binary data containing literal line breaks:

Since the canonical representation of types other than text do not generally include the representation of line breaks, no hard line breaks (i.e. line breaks that are intended to be meaningful and to be displayed to the user) should occur in the quoted-printable encoding of such types.

So there's a giant warning at the end of the RFC:

WARNING TO IMPLEMENTORS: If binary data are encoded in quoted-printable, care must be taken to encode CR and LF characters as "=0D" and "=0A", respectively. In particular, a CRLF sequence in binary data should be encoded as "=0D=0A". Otherwise, if CRLF were represented as a hard line break, it might be incorrectly decoded on platforms with different line break conventions.

The sender is clearly not obeying this warning when encoding, and so some mail transfer agent or gateway between the sender and me is deciding that an appropriate line break for my locale is simply \n (which it normally is).

Anyhow, I discovered this was the problem by comparing my quopri-decoded attachment byte by byte against an original copy of the attached ZIP file. The two were identical except that every CRLF in the original was simply a LF in my decode. Because the \r is clearly meaningful, and because every other newline in the QP-encoding is correctly prefaced by a line-wrapping = character, I've simply written the following transform for all QP-encoded application MIME types from this sender:

if part['Content-Disposition'].startswith('attachment') and \
   part['Content-Transfer-Encoding'] == 'quoted-printable':
    rawContent = part.get_payload(decode=False)
    fixedRawContent = re.sub(r'([^=])\n', r'\1=0D=0A=\n', rawContent)
    decodedContent = quopri.decodestring(fixedRawContent)

By turning every hard (unexpected) newline into an encoded \r\n (followed by my own soft newline just so I don't have to worry about creating any overlong lines), the decode function dutifully places said \r\n into the ZIP, which then extracts correctly.

Community
  • 1
  • 1
dg99
  • 5,456
  • 3
  • 37
  • 49