0

We are using JavaMail to send mail with PDF attachments. When Unicode characters are present in the filename, the attachments seem to be named as the UTF encoded name. Upon further inspection of the mail headers found that the ? in the filename MIME is dropped. For example

Expected:

Content-Disposition: attachment; 
    filename="=?utf8?Q?hinzugef=C3=BCgte.pdf?="

Obtained:

Content-Disposition: attachment; 
    filename="=utf8Qhinzugef=C3=BCgte.pdf="

And because of this the Filename in the attachment is =utf8Qhinzugef=C3=BCgte.pdf= and we are unable to open it.

If I manually modify the .eml file and add the ? in the right places and open it in outlook, the file is displayed in PDF format as expected.

This issue has been reported in Exchange server and we are unable to reproduce it in Gmail or Fake SMTP (on my machine, used to test mail)

Sample code:

MimeBodyPart mbp2 = new MimeBodyPart();
String attFileName = file.getName();
String i18nFileName = new String(attFileName.getBytes(), "UTF-8");
String mimeType = mimeMap.getContentType(attFileName);
attStream = new FileInputStream(att);
ByteArrayDataSource bas = new ByteArrayDataSource(attStream, mimeType);
mbp2.setDataHandler(new DataHandler(bas));
mbp2.setFileName(MimeUtility.encodeText(i18nFileName));
mp.addBodyPart(mbp2);
if (attStream != null) {
    attStream.close();
}

Why does this happen? Any leads would be very helpful

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197

1 Answers1

0

This is wrong encoded to begin with.

  • What you implemented was RFC 2047, but that doesn't apply to HTTP at all.
  • RFC 6266 § 4.3 explains how to deal with the filename= parameter for that HTTP header and then refers to
  • RFC 5897, obsoleted by RFC 8187 § 3.2.3 on how to incorporate non-ASCII.

The generic form is filename*=UTF-8''Na%C3%AFve%20file.txt and it differs in several aspects from RFC 2047 which you implemented:

  • filename*= should be used - note the trailing asterisk at the parameter. This is to signal extended notation - otherwise neither a charset nor percent encoding is expected.
  • Enclosing the value in "quotation marks" is neither needed, nor allowed when using extended notation.
  • Likewise the prefix =?, the suffix ?=, and the ?Q? encoding parameter are ever expected. Logically they also make no sense, as only quoted encoding is available and the whole non-ASCII scope is entirely, not just somewhere.
  • The '' part is for the optional language code - it could be 'en' for English, but effectively nobody cares about that.
  • The rest is trivial: each byte of a UTF-8 character sequence is quoted encoded. A space must be quote encoded, too (speak: %20).
  • The correct charset is UTF-8, while utf8 is wrong - don't rely on being accepted with that unofficial alias although it is tolerated every now and then.

In other words: the client acted correctly. If I use Thunderbird 68 and either hit CTRL+Q to see an e-mail's source, or save an e-mail as an .EML file and then look into that file, I have a multipart where each attachment has the headers

Content-Disposition: inline;
    filename*=utf-8''L%20%2D%20qualita%CC%88t.pdf
Content-Type: application/pdf;
    x-unix-mode=0644;
    name="=?utf-8?Q?L_-_qualita=CC=88t=2Epdf?="

Don't get confused because you now see both variants - they still have different purposes and different contexts. What you want is primarily the filename (although it can't hurt to also provide a name). If you look closely the values also differ (former has spaces, latter uses underscores - but that was the sender's free decision). The UTF-8 character sequence %CC%88 or =CC=88 is the codepoint U+0308 = ̈ COMBINING DIAERESIS (making the a before an ä).

This answer explains how differently HTTP browsers treated RFC 5897 in the year 2011.

Community
  • 1
  • 1
AmigoJack
  • 5,234
  • 1
  • 15
  • 31