2

I am migrating my ticket system to pgSQL. I allow for e-mail replies where PHP parses each e-mail into it's components and then message is then stored in a pgSQL table named inbox .

The first e-mail parsed and then saved successfully. There were no errors. Now I am receiving the error message

invalid byte sequence for encoding "UTF8": 0x86

I've confirmed the database is using UTF8 encoding: - SHOW SERVER_ENCODING gives a result of UTF8 - SHOW CLIENT_ENCODING originally wasn't UTF8. I set this to UTF8.

The error persists.

email_queue.php contains various PHP classes and functions to receive and send e-mails. The command "file email_queue.php" gives the result:

email_queue.php: PHP script, UTF-8 Unicode text, with very long lines

email_queue_receive.php uses the classes and functions for receive e-mails. This file includes email_queue.php for the functionality. The command "file email_queue_receive.php" gives the result:

email_queue_receive.php: PHP script, ASCII text

From searches I've done ASCII is a valid UTF8.

I haven't yet found a thread specific to this error as a result of parsing e-mail.

Ron Piggott
  • 705
  • 1
  • 8
  • 26

2 Answers2

3

PostgreSQL is strict on encoding, but the email infrastructure is not. As the doc for PHP's iconv_mime_decode indicates:

ICONV_MIME_DECODE_STRICT If set, the given header is decoded in full conformance with the standards defined in » RFC2047. This option is disabled by default because there are a lot of broken mail user agents that don't follow the specification and don't produce correct MIME headers.

There are also MIME parts in email bodies that violate the character advertised in the Content-Type declaration. An invalid mail will be accepted by SMTP servers as long it can be routed to a recipient, so senders are not made aware of the problem, it's the recipient that has to deal with it.

As a consequence, any part of an email message that has to be inserted into a database text field must be sanitized beforehand. See for example Remove non-utf8 characters from string on how to do it.

Community
  • 1
  • 1
Daniel Vérité
  • 58,074
  • 15
  • 129
  • 156
  • I am trying to implement the solution @ http://stackoverflow.com/a/1401716/3123313 . The e-mail that was causing the problem now saves in the database successfully. But now a different e-mail won't process. Now there are more characters following the error message: – Ron Piggott Jul 04 '15 at 22:36
  • ERROR: invalid byte sequence for encoding "UTF8": 0xf6 0xa2 0x95 0xa6 – Ron Piggott Jul 04 '15 at 22:36
  • Will you help me more? If so what exact details to you need? – Ron Piggott Jul 04 '15 at 22:36
  • I am using the PHP pgSQL PDO class for this query. I am unsure if I can display the INSERT INTO the script is trying to perform with the values displayed. – Ron Piggott Jul 04 '15 at 22:41
  • @RonPiggott: what part of the mail fails to insert? – Daniel Vérité Jul 06 '15 at 09:23
2

(Daniel's right, just elaborating):

0x86 can't be the first byte in a utf-8 sequence.

Possible explanations include:

  • The email is not utf-8 encoded
  • The email is utf-8 encoded but the utf-8 in the email is malformed
  • A string is being cut at an invalid byte offset in a utf-8 sequence by non-utf-8-aware substring code
  • your app is mishandling encodings in the MIME parts
  • ...

In general, you're going to have problems inserting email into PostgreSQL because PostgreSQL is very strict about text encoding correctness, wheras mail clients produce and accept all sorts of horrible garbage. You will need to either sanitize the incoming mail (using encoding guessing, stripping suspect parts/chars, etc) or store it in raw byte sequence form as bytea.

I strongly recommending storing as bytea because:

  • One MIME message can contain parts in different encodings
  • MIME parts like email attachments can be sent containing NULL bytes if they don't have a Content-Transfer-Encoding, though most clients won't do so and will base64-encode them. PostgreSQL's text type cannot store null bytes.

Of course, that depends a lot on what you're processing. You might prefer to store as text and discard parts that can't be decoded using their declared text encoding.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778