57

My PHP script sends email to users and when the email arrives to their mailboxes, the subject line ($subject) has characters like a^£ added to the end of my subject text. This is obviously and encoding problem. The email message content itself is fine, just the subject line is broken.

I have searched all over but can’t find how to encode my subject properly.

This is my header. Notice that I’m using Content-Type with charset=utf-8 and Content-Transfer-Encoding: 8bit.

//set all necessary headers
$headers = "From: $sender_name<$from>\n";
$headers .= "Reply-To: $sender_name<$from>\n";
$headers .= "X-Sender: $sender_name<$from>\n";
$headers .= "X-Mailer: PHP4\n"; //mailer
$headers .= "X-Priority: 3\n"; //1 UrgentMessage, 3 Normal
$headers .= "MIME-Version: 1.0\n";
$headers .= "X-MSMail-Priority: High\n";
$headers .= "Importance: 3\n";
$headers .= "Date: $date\n";
$headers .= "Delivered-to: $to\n";
$headers .= "Return-Path: $sender_name<$from>\n";
$headers .= "Envelope-from: $sender_name<$from>\n";
$headers .= "Content-Transfer-Encoding: 8bit\n";
$headers .= "Content-Type: text/plain; charset=UTF-8\n";
Palec
  • 12,743
  • 8
  • 69
  • 138
daza166
  • 3,543
  • 10
  • 35
  • 41
  • 4
    Have you thought about using http://phpmailer.worxware.com/ this will save you loads of hassle. – Ashley Dec 08 '10 at 16:24
  • 3
    In addition to the provided answers, note that according to [the docs](http://php.net/manual/en/function.mail.php), you are supposed to separate multiple headers with CRLF (`\r\n`), not just LF (`\n`). – Mike Dec 08 '10 at 16:27

3 Answers3

85

Update   For a more practical and up-to-date answer, have a look at Palec’s answer.


The specified character encoding in Content-Type does only describe the character encoding of the message body but not the header. You need to use the encoded-word syntax with either the quoted-printable encoding or the Base64 encoding:

encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

You can use imap_8bit for the quoted-printable encoding and base64_encode for the Base64 encoding:

"Subject: =?UTF-8?B?".base64_encode($subject)."?="
"Subject: =?UTF-8?Q?".imap_8bit($subject)."?="
Community
  • 1
  • 1
Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • gumbo, I dont understand the difference between base64 or imap_8bit? When should I use one or other? would it be like this : $subject = '=?UTF-8?B?'.base64_encode($subject).'?=this is the subject'; or for I not need the '?=' where subject text goes? – daza166 Dec 08 '10 at 16:55
  • 2
    @user535256: No, the actual subject needs to be encoded with one of the encodings. Which one you pick is your decision. *Quoted-printable* is quite more readable as most of the printable ASCII characters are retained; but it will take more space if your subjects are likely to contain a lot of non-ASCII characters as each byte will be replaced by one three byte sequence of `=xx`. – Gumbo Dec 08 '10 at 17:01
  • In the project I was working on there was a problem with characters from Russian language. Here is a code that had a not valid character in UTF8. `Subject: =?utf-8?B?0LjQstGB0YLRg9C/0LjRgtC10LvRjNC90L7QtSDQsdGA0L7Q?= =?utf-8?B?vdC40YDQvtCy0LDQvdC40LUg0L3QvtC80LXRgNGMIDA0OC0xMzktMTMg?= =?utf-8?B?LSBBbWlnb3MgQXBhcnRtZW50IC0g0L/Qu9Cw0YLQtdC2INC/0LXRgNC1?= =?utf-8?B?0YfQuNGB0LvQtdC90LXQvA==?=` – Tomasz Kuter Oct 25 '13 at 12:47
  • Finally I fixed problem by putting each word into separate line of the message header: `Subject: =?utf-8?B?0LjQstGB0YLRg9C/0LjRgtC10LvRjNC90L7QtQ==?= =?utf-8?B?INCx0YDQvtC90LjRgNC+0LLQsNC90LjQtQ==?= =?utf-8?B?INC90L7QvNC10YDRjA==?= =?utf-8?B?IDA1Mi0xMzktMTM=?= =?utf-8?B?IC0=?= =?utf-8?B?IEFtaWdvcw==?= =?utf-8?B?IEFwYXJ0bWVudA==?= =?utf-8?B?IC0=?= =?utf-8?B?INC/0LvQsNGC0LXQtg==?= =?utf-8?B?INC/0LXRgNC10YfQuNGB0LvQtdC90LXQvA==?=` I hope this will be helpful for someone - I spend 8h on debugging and fixing that problem. – Tomasz Kuter Oct 25 '13 at 12:49
  • 3
    You can also use [quoted_printable_encode()](http://uk.php.net/function.quoted-printable-encode.php) which according to the doc, *is similar to `imap_8bit()`, except this one does not require the IMAP module to work*. – BenMorel Nov 26 '13 at 10:12
  • 2
    While the basic idea is OK, this method violates the RFC for longer inputs. It is specified that each encoded word (`=?…?…?…?=`) must be at most 75 chars long and lines containing encoded words must be at most 76 chars long (including the space at the beginning of a continuation line). It is necessary to encode the text into more words and fold the field so that it fits into the limits. – Palec Dec 25 '14 at 13:49
  • 1
    Note that because of [RFC6532](https://tools.ietf.org/html/rfc6532) what you did originally should now work with email clients that implement it, however, the rfc is very recent (2012) so I guess very few clients implement it. – Legolas Nov 18 '15 at 15:08
73

TL;DR

$preferences = ['input-charset' => 'UTF-8', 'output-charset' => 'UTF-8'];
$encoded_subject = iconv_mime_encode('Subject', $subject, $preferences);
$encoded_subject = substr($encoded_subject, strlen('Subject: '));
mail($to, $encoded_subject, $message, $headers);

or

mb_internal_encoding('UTF-8');
$encoded_subject = mb_encode_mimeheader($subject, 'UTF-8', 'B', "\r\n", strlen('Subject: '));
mail($to, $encoded_subject, $message, $headers);

Problem and solution

The Content-Type and Content-Transfer-Encoding headers apply only to the body of your message. For headers, there is a mechanism for specifying their encoding specified in RFC 2047.

You should encode your Subject via iconv_mime_encode(), which exists as of PHP 5:

$preferences = ["input-charset" => "UTF-8", "output-charset" => "UTF-8"];
$encoded_subject = iconv_mime_encode("Subject", $subject, $preferences);

Change input-charset to match the encoding of your string $subject. You should leave output-charset as UTF-8. Before PHP 5.4, use array() instead of [].

Now $encoded_subject is (without trailing newline)

Subject: =?UTF-8?B?VmVyeSBsb25nIHRleHQgY29udGFpbmluZyBzcGVjaWFsIGM=?=
 =?UTF-8?B?aGFyYWN0ZXJzIGxpa2UgxJvFocSNxZnFvsO9w6HDrcOpPD4/PSsqIHA=?=
 =?UTF-8?B?cm9kdWNlcyBzZXZlcmFsIGVuY29kZWQtd29yZHMsIHNwYW5uaW5nIG0=?=
 =?UTF-8?B?dWx0aXBsZSBsaW5lcw==?=

for $subject containing:

Very long text containing special characters like ěščřžýáíé<>?=+* produces several encoded-words, spanning multiple lines

How does it work?

The iconv_mime_encode() function splits the text, encodes each piece separately into an <encoded-word> token and folds the whitespace between them. Encoded word is =?<charset>?<encoding>?<encoded-text>?= where:

You can decode =?CP1250?B?QWhvaiwgc3bsdGU=?= into UTF-8 string Ahoj, světe (Hello, world in Czech) via iconv("CP1250", "UTF-8", base64_decode("QWhvaiwgc3bsdGU=")) or directly via iconv_mime_decode("=?CP1250?B?QWhvaiwgc3bsdGU=?=", 0, "UTF-8").

Encoding into encoded words is more complicated, because the spec requires each encoded-word token to be at most 75 bytes long and each line containing any encoded-word token must be at most 76 bytes long (including blank at the start of a continuation line). Don’t implement the encoding yourself. All you really need to know is that iconv_mime_encode() respects the spec.

Interesting related reading is the Wikipedia article Unicode and email.

Alternatives

A rudimentary option is to use only a restricted set of characters. ASCII is guaranteed to work. ISO Latin 1 (ISO-8859-1), as user2250504 suggested, will probably work too, because it is often used as fallback when no encoding is specified. But those character sets are very small and you’ll probably be unable to encode all the characters you’ll want. Moreover, the RFCs say nothing about whether Latin 1 should work or not.

You can also use mb_encode_mimeheader(), as Paul Norman answered, but it’s easy to use it incorrectly.

  1. You must use mb_internal_encoding() to set the mbstring functions’ internally used encoding. The mb_* functions expect input strings to be in this encoding. Beware: The second parameter of mb_encode_mimeheader() has nothing to do with the input string (despite what the manual says). It corresponds to the <charset> in the encoded word (see How does it work? above). The input string is recoded from the internal encoding to this one before being passed to the B or Q encoding.

    Setting internal encoding might not be needed since PHP 5.6, because the underlying mbstring.internal_encoding configuration option had been deprecated in favor of the default_charset option, which has been set to UTF-8 by default, since. Note that this is just a default and it may be inappropriate to rely on defaults in your code.

  2. You must include the header name and colon in the input string. The RFC imposes a strong limit on line length and it must hold for the first line, too! An alternative is to fiddle with the fifth parameter ($indent; last one as of September 2015), but this is even less convenient.

  3. The implementation might have bugs. Even if used correctly, you might get broken output. At least this is what many comments on the manual page say. I have not managed to find any problem, but I know implementation of encoded words is tricky. If you find potential or actual bugs in mb_encode_mimeheader() or iconv_mime_encode(), please, let me know in the comments.

There is also at least one upside to using mb_encode_mimeheader(): it does not always encode all the header contents, which saves space and leaves the text human-readable. The encoding is required only for the non-ASCII parts. The output analogous to the iconv_mime_encode() example above is:

Subject: Very long text containing special characters like
 =?UTF-8?B?xJvFocSNxZnFvsO9w6HDrcOpPD4/PSsqIHByb2R1Y2VzIHNldmVyYWwgZW5j?=
 =?UTF-8?B?b2RlZC13b3Jkcywgc3Bhbm5pbmcgbXVsdGlwbGUgbGluZXM=?=

Usage example of mb_encode_mimeheader():

mb_internal_encoding('UTF-8');
$encoded_subject = mb_encode_mimeheader("Subject: $subject", 'UTF-8');
$encoded_subject = substr($encoded_subject, strlen('Subject: '));
mail($to, $encoded_subject, $message, $headers);

This is an alternative to the snippet in TL;DR on top of this post. Instead of just reserving the space for Subject: , it actually puts it there and then removes it in order to be able to use it with the mail()’s stupid interface.

If you like mbstring functions better than the iconv ones, you might want to use mb_send_mail(). It uses mail() internally, but encodes subject and body of the message automatically. Again, use with care.

Headers other than Subject need different treatment

Note that you must not assume that encoding the whole contents of a header is OK for all headers that may contain non-ASCII characters. E.g. From, To, Cc, Bcc and Reply-To may contain names for the addresses they contain, but only the names may be encoded, not the addresses. The reason is that <encoded-word> token may replace just <text>, <ctext> and <word> tokens, and only under certain circumstances (see §5 of RFC 2047).

Encoding of non-ASCII text in other headers is a related but different question. If you wish to know more about this topic, search. If you find no answer, ask another question and point me to it in the comments.

Community
  • 1
  • 1
Palec
  • 12,743
  • 8
  • 69
  • 138
24

mb_encode_mimeheader() for UTF-8 strings can be useful here, e.g.

$subject = mb_encode_mimeheader($subjectText,"UTF-8");
Andrew Lott
  • 185
  • 15
Paul Norman
  • 1,621
  • 1
  • 9
  • 20
  • 2
    I experienced strange effects when using mb-encode-mimeheader: The ``=?UTF-8?B?`` prefix was not added to the beginning of my subject string, but somewhere in the middle. So I reverted to building the encoded-word syntax manually as Gumbo has shown. – Jpsy Sep 20 '12 at 21:39
  • 3
    @Jpsy That’s fine. It suffices to just encode those words with non-ASCII characters or even just those characters alone. But you have to be aware that [intermediate spaces are getting collapsed](http://stackoverflow.com/a/1294391/53114) which can lead to unexpected results. – Gumbo Dec 17 '12 at 05:56