1

I am experimenting with reading multipart/mixed emails with GMail API.
The goal is to correctly decode each text/plain part of the multipart/mixed email (there can be many, in different encodings) to a C# string (i.e. UTF-16):

public static string DecodeTextPart(Google.Apis.Gmail.v1.Data.MessagePart part)
{
    var content_type_header = part.Headers.FirstOrDefault(h => string.Equals(h.Name, "content-type", StringComparison.OrdinalIgnoreCase));

    if (content_type_header == null)
        throw new ArgumentException("No content-type header found in the email part");

    var content_type = new System.Net.Mime.ContentType(content_type_header.Value);

    if (!string.Equals(content_type.MediaType, "text/plain", StringComparison.OrdinalIgnoreCase))
        throw new ArgumentException("The part is not text/plain");

    return Encoding.GetEncoding(content_type.CharSet).GetString(GetAttachmentBytes(part.Body));
}

GetAttachmentBytes returns raw attachment bytes, without conversion, decoded from the base64url encoding that GMail uses.

What I find is that in many cases this produces invalid strings, because the raw bytes that I get for the attachment content appear to always be in UTF-8, even though content-type of that same part declares otherwise.

E.g. given the email:

Date: ...
From: ...
Reply-To: ...
Message-ID: ...
To: ...
Subject: Test 1 text file
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="----------0E50FC0802A2FCCAA"

------------0E50FC0802A2FCCAA
Content-Type: text/plain; charset=windows-1251
Content-Transfer-Encoding: 8bit


Content test: Cyrillic, Windows-1251 (à, ÿ, æ)
------------0E50FC0802A2FCCAA
Content-Type: TEXT/PLAIN;
 name="Irrelevant.txt"
Content-transfer-encoding: base64
Content-Disposition: attachment;
 filename="Irrelevant.txt"

VGhpcyBmaWxlIGRvZXMgbm90IGNvbnRhaW4gdXNlZnVsIGluZm9ybWF0aW9u
------------0E50FC0802A2FCCAA--

, I successfully find the first part, the code above figures that it's charset=windows-1251 with the help of System.Net.Mime.ContentType, and then .GetString() returns garbage because the actual raw bytes returned by GetAttachmentBytes correspond to UTF-8 encoding, not Windows-1251.

Exactly the same happens with

Subject: Test 2 text file
MIME-Version: 1.0
Content-Type: multipart/mixed;
 boundary="----------0B716C1D8123D8710"

------------0B716C1D8123D8710
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 8bit


Content test: Cyrillic, koi-8 (Б, С, Ц)
------------0B716C1D8123D8710
Content-Type: TEXT/PLAIN;
 name="Irrelevant.txt"
Content-transfer-encoding: base64
Content-Disposition: attachment;
 filename="Irrelevant.txt"

VGhpcyBmaWxlIGRvZXMgbm90IGNvbnRhaW4gdXNlZnVsIGluZm9ybWF0aW9u
------------0B716C1D8123D8710--

Note that the three test letters in the parentheses after the encoding name are the same in both emails, and in Unicode look like (а, я, ж), but (correctly) look wrong in the email body represenatation quoted above due to different encodings.

If I "fix" the function to always use Encoding.UTF8 instead of GetEncoding(content_type.CharSet), then it appears to work in the tests that I've done so far.

At the same time, the GMail interface displays the letters correctly in both cases, so it must have correctly parsed the incoming emails using the correct declared encodings.

Is it the case that the GMail API re-encodes all text chunks into UTF-8 (wrapped in base64url), but reports the original charset for them?
Am I therefore supposed to always use UTF-8 with GMail API and disregard content-type's charset=?
Or is there a problem with my code?

GSerg
  • 76,472
  • 17
  • 159
  • 346
  • You have an encoding issue. The encoding is based on the language header : . You are using the wrong language which is why the character are not being displayed properly, or you machine doesn't have the correct FONT and your machine is substituting a different font. – jdweng Jan 09 '20 at 12:26
  • @jdweng There is no HTML. There is no encoding problem with the quoted emails either. They display correctly in the email clients and in the GMail web interface. – GSerg Jan 09 '20 at 12:40
  • Is you email set to html or text? You said " look wrong in the email body due to different encodings". You have html!!! – jdweng Jan 09 '20 at 12:48
  • @jdweng My email is set to `multipart/mixed` as you can see from the question. It contains two parts, one `text/plain` body and one `text/plain` attachment. There is no HTML. The email bodies look correct in the email clients and in the GMail interface. They look "incorrect" in the quoted representation of the emails bodies here on stack overflow, which is correct because the encoding of the quoted text block does not match the encoding of this stack overflow page. I have clarified this wording. This is not the issue. The issue is that bytes returned from GMail correspond to UTF-8. – GSerg Jan 09 '20 at 12:56
  • Then the encoding is using the culture of your PC as the default. Not sure if you can change that dynamically. I also don't think that you can have an attachment on a text email. – jdweng Jan 09 '20 at 13:03
  • @jdweng This has nothing to do with default culture of my PC, and it could not cause GMail, which is in the cloud, to return data specifically in UTF-8. I have no problem decoding the message. The problem is that GMail API reports one encoding but uses another. And you can absolutely have an attachment to a text email. If fact you can have an email that is only an attachment, with no body parts at all. – GSerg Jan 09 '20 at 13:06
  • 1
    @GSerg https://meta.stackoverflow.com/questions/392514/what-can-i-do-about-a-user-consistently-spreading-misinformation – CodeCaster Jan 09 '20 at 14:21
  • 1
    @CodeCaster I've been wanting to do just that for about six months, just didn't know how to make it sound objective... – GSerg Jan 09 '20 at 16:22

1 Answers1

4

According to these two resources:

The Value is indeed a base-64 encoded representation of the part converted to UTF-8.

This is however not documented by Google, as far as I can find.

CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • 1
    Thank you! Someone has found it apparently, judging by the "According to the API documentation, response is always UTF-8 encoded." from the second link. I cannot find it either though. – GSerg Jan 09 '20 at 16:12