Python 3 - if a string contains only ASCII, is it equal to the string as bytes?

Question

Consider Python 3 SMTPD - the data received is contained in a string. http://docs.python.org/3.4/library/smtpd.html quote: "and data is a string containing the contents of the e-mail"

Facts (correct?):

Strings in Python 3 are Unicode.
Emails are always ASCII.
Pure ASCII is valid Unicode.

Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?

Thus my question, if I decode the SMTPD DATA string to ASCII, or convert the DATA string to bytes, is this equivalent to the bytes of the actual email message that arrived via SMTP?

Context, (and perhaps a better question) is "How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?" My concern is that when DATA goes through string to bytes conversion then somehow it has been changed from the original bytes that arrived via SMTP.

EDIT: it seems the Python developers think SMTPD should be returning binary data anyway. Doesn't seem to have been fixed... http://bugs.python.org/issue19662

If your goal is to record the exact bytes to find malformed emails, you're probably out of luck. The interface doesn't seem to provide that. You can't make the assumption that malformed content would be ASCII even if the RFC mandates that it should be. — Mark Ransom, Feb 06 '14 at 23:32
Hi Mark - my goal is not to find malformed emails but to save inbound emails to disk in precisely the binary/bytes form that they arrived. — Duke Dougal, Feb 06 '14 at 23:39

score 4 · Accepted Answer · edited Oct 07 '21 at 05:57

if a string contains only ASCII, is it equal to the string as bytes?

No. It is not equal in Python 3:

>>> '1' == b'1'
False

bytes object is not equal to str (Unicode string) object in a similar way that an integer is not equal to a string:

>>> '1' == 1
False

In some programming languages the above comparisons are true e.g., in Python 2:

>>> b'1' == u'1'
True

and 1 == '1' in Perl:

$ perl -e "print qq(True\n) if 1 == q(1)"
True

Your question is a good example of why the stricter Python 3 behaviour is preferable. It forces programmers to confront their text/bytes misconceptions without waiting for their code to break for some input.

Strings in Python 3 are Unicode.

yes. Strings are immutable sequences of Unicode code points in Python 3.

Emails are always ASCII.

Most emails are transported as 7-bit messages (ASCII range: hex 00-7F). Though "virtually all modern email servers are 8-bit clean." i.e., 8-bit content won't be corrupted. And 8BITMIME extension sanctions the passing of some of 8-bit content.

In other words: emails are not always ASCII.

Pure ASCII is valid Unicode.

ASCII is a character encoding. You can decode some byte sequences to Unicode using US-ASCII character encoding. Unicode strings have no associated character encoding i.e., you can encode them into bytes using any character encoding that can represent corresponding Unicode code points.

Therefore the email that came in is pure ASCII (which is valid Unicode), therefore the SMTPD DATA string is exactly equivalent to the original bytes received by SMPTD. Is this correct?

If input is in ascii range then data.decode('ascii', 'strict').encode('ascii') == data. Though Lib/smtpd.py does some conversions to the input data (according to RFC 5321) therefore the content that you get as data may be different even if the input is pure ASCII.

"How do I save to a file Python 3's SMTPD DATA as PRECISELY the bytes that were received?"

my goal is not to find malformed emails but to save inbound emails to disk in precisely the binary/bytes form that they arrived.

The bug that you've linked (smtpd.py should not decode utf-8) makes smptd.py non 8-bit clean.

You could override SMTPChannel.collect_incoming_data method from smtpd.py to save incoming bytes as is.

"A string of ASCII text is also valid UTF-8 text."

It is true. It is a nice property of UTF-8 encoding. If you can decode a byte sequence into Unicode using US-ASCII character encoding then you can also decode the bytes using UTF-8 character encoding (and the resulting Unicode code points are the same in both cases).

smptd.py should have used either latin1 (it decodes any byte sequence) or ascii (with 'strict' error handler to fail on any non-ascii byte) instead of utf-8 (it allows some non-ascii bytes -- bad).

Keep in mind:

some emails may have bytes outside ascii range
de-transparency according to RFC 5321 doesn't preserve input bytes as-is even if they are all in ascii range

Awesome answer thanks JF - would you mind commenting on this from the Python docs, and how it relates to your answer above thanks: "A string of ASCII text is also valid UTF-8 text." http://docs.python.org/3.3/howto/unicode.html — Duke Dougal, Feb 07 '14 at 05:56
@DukeDougal: I've commented on "A string of ASCII text is also valid UTF-8 text." — jfs, Feb 07 '14 at 06:16

Python 3 - if a string contains only ASCII, is it equal to the string as bytes?

1 Answers1

Linked