C# Character encoding issues

Question

I'm trying to make a transport agent that parses the body of an email to find the pertinent pieces of information and replaces the generic subject line with specific details. My problem is that a subject line that should read like: ABC Co Error: No status reason code (0123456) instead shows up as A B C C o E r r o r : N o s t a t u s r e a s o n c o d e ( 0 1 2 3 4 5 6 )

The email is plain text and encoded in us-ascii according to the email header. My problem is that It is my understanding from this question and this question that C# uses UTF-16 as the default string encoding. The spaces between each character lead me to believe that my code is somehow implicitly converting the ASCII to UTF-16, but I don't know where this would be happening. Any ideas on how to make this work properly?

    void OnSubmittedMessageHandler(SubmittedMessageEventSource source, QueuedMessageEventArgs args)
    {
        this.mailItem = args.MailItem;
        for (int intCounter = this.mailItem.Recipients.Count - 1; intCounter >= 0; intCounter--)
        {
            // Check if the email was sent to automated@mydomain.com
            string msgRecipientP1 = this.mailItem.Recipients[intCounter].Address.LocalPart;
            if (msgRecipientP1.ToLower() == "automated")
            {
                // Read the body of the email
                string line = "";
                Dictionary<string, string> EDIErrors = new Dictionary<string, string>();
                Body body = this.mailItem.Message.Body;
                Stream originalBodyContent = body.GetContentReadStream();
                StreamReader streamReader = new StreamReader(originalBodyContent, System.Text.Encoding.ASCII, true);
                while ((line = streamReader.ReadLine()) != null)
                {
                    if (line.IndexOf("Partner:") > 0)
                    {
                        line.Replace(": ", ":");
                        string[] lineParts = line.Split(new[] { "  " }, StringSplitOptions.None);
                        foreach (string EDIErrorPart in lineParts)
                        {
                            int idx = EDIErrorPart.IndexOf(':');
                            int qidx = EDIErrorPart.IndexOf('"');
                            if (idx > 0)
                            {
                                EDIErrors[EDIErrorPart.Substring(0, idx).ToLower()] = EDIErrorPart.Substring(idx + 1).ToLower();
                            }
                            else if (qidx > 0)
                            {
                                EDIErrors["Message"] = EDIErrorPart.Replace("\"", string.Empty);
                            }
                        }
                    }
                }
                if (originalBodyContent != null)
                {
                    originalBodyContent.Close();
                }

                // Build the new Subject line and the recipient groups
                string sOrder;
                string sMessage;
                string sDistroGroup;
                EDIErrors.TryGetValue("Order", out sOrder);
                EDIErrors.TryGetValue("Message", out sMessage);
                EDIErrors.TryGetValue("Partner", out sDistroGroup);
                string NewSubject = sPartner + " Error: " + sMessage + "(" + sOrder + ")";

                this.mailItem.Message.Subject = NewSubject;
                if (IsTicketable)
                {
                    this.mailItem.Recipients.Add(new RoutingAddress("helpdesk@mydomain.com"));
                }
            }
        }
        return;
    }

Are you able to step through the code to identify where the problem occurs? I'm curious if you set a breakpoint inside the while loop whether `line` has the extra spaces or if they're added when setting the new subject — aleroy, Jul 08 '19 at 22:12
"C# uses UTF-16 as the default string encoding": Yes, it is the only `string` and `char` character encoding. The key point is that string and char are basically in-memory representations of Unicode text. When you read a stream or file your code needs decode the bytes using the character encoding it was written with. Check the mime type related to `originalBodyContent`. — Tom Blodget, Jul 09 '19 at 02:41
unrelated to your question, but I thought it is worth pointing that this line: `line.Replace(": ", ":");` is doing nothing (well it is doing some work, but you are dropping the result on the floor). In C# strings are immutable. [String.Replace](https://learn.microsoft.com/en-us/dotnet/api/system.string.replace?view=netframework-4.8#System_String_Replace_System_String_System_String_) returns a new string, it does not change the string it is operating on. — pstrjds, Jul 09 '19 at 11:12
Good catch @pstrjds : @TomBlodget, the incoming email is plain text. The headers show `Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit` — Tim, Jul 09 '19 at 12:49
Where does the `originalBodyContent` come from. Is that some global that you are setting somewhere else? Since you have the message in your hand already via the `MailItem`, why not just grab the `Body` property from the `MailItem.Message` and perform your operation on that? I would think then you wouldn't run into the encoding issues. — pstrjds, Jul 09 '19 at 16:13
Sorry!! While making my minimal example, I cut too much other code, including the lines where I assign `Stream originalBodyContent = this.mailItem.Message.Body.GetContentReadStream();` — Tim, Jul 09 '19 at 16:39
Okay. It has been a while since I messed with the Exchange API and I had forgotten that the `Body` is not just a string, but a different object and that you have to read the stream to grab it. My bad, but thanks for updating the question. — pstrjds, Jul 09 '19 at 18:21
From the "spaces" between each Basic Latin character, it looks like your message body is encoded with UTF-16LE (aka `Encoding.Unicode`). What does [`body.CharsetName`](https://learn.microsoft.com/en-us/previous-versions/office/exchange-server-api/aa566970(v=exchg.150)) say? — Tom Blodget, Jul 10 '19 at 01:00
@TomBlodget Good call. I assumed that since the email is being sent with a US ASCII charset header, it was an ascii body. It is not. `body.CharsetName` returns `utf-16`. I guess I know where the conversion is taking place: in the Exchange server before I even touch it. — Tim, Jul 12 '19 at 19:39
Well dang. It also returns `utf-8`. I had second email come in from another system while I had my transport agent running. I'm thinking that a PS script using EWS is a better option. — Tim, Jul 12 '19 at 21:14

C# Character encoding issues

0 Answers0