1

I am trying to read an email from POP3 and change to the correct encoding when I find the charset in the headers.

I use a TCP Client to connect to the POP3 server.

Below is my code :

    public string ReadToEnd(POP3Client pop3client, out System.Text.Encoding messageEncoding)
    {
        messageEncoding = TCPStream.CurrentEncoding;
        if (EOF)
            return ("");

        System.Text.StringBuilder sb = new System.Text.StringBuilder(m_bytetotal * 2);
        string st = "";
        string tmp;

        do
        {
            tmp = TCPStream.ReadLine();
            if (tmp == ".")
                EOF = true;
            else
                sb.Append(tmp + "\r\n");

            //st += tmp + "\r\n";

            m_byteread += tmp.Length + 2; // CRLF discarded by read

            FireReceived();

            if (tmp.ToLower().Contains("content-type:") && tmp.ToLower().Contains("charset="))
            {
                try
                {
                    string charSetFound = tmp.Substring(tmp.IndexOf("charset=") + "charset=".Length).Replace("\"", "").Replace(";", "");
                    var realEnc = System.Text.Encoding.GetEncoding(charSetFound);

                    if (realEnc != TCPStream.CurrentEncoding)
                    {
                        TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);
                    }
                }
                catch { }
            }                
        } while (!EOF);

        messageEncoding = TCPStream.CurrentEncoding;

        return (sb.ToString());
    }

If I remove this line:

TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);

Everything works fine except that when the e-mail contains different charset characters I get question marks as the initial encoding is ASCII.

Any suggestions on how to change the encoding while reading data from the Network Stream?

Ehsan Sajjad
  • 61,834
  • 16
  • 105
  • 160
net_L
  • 25
  • 1
  • 4
  • why don't you try to decode everything as utf8? TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), System.Text.Encoding.UTF8); – pedrommuller Mar 20 '14 at 13:27
  • As per RFC 2045 (section 5.2) 5.2. Content-Type Defaults Default RFC 822 messages without a MIME Content-Type header are taken by this protocol to be plain text in the US-ASCII character set, which can be explicitly specified as: Content-type: text/plain; charset=us-ascii http://www.ietf.org/rfc/rfc2045.txt – net_L Mar 20 '14 at 14:38
  • By the way when I tried to read everything as UTF-8 I encoutered some symbols instead of characters when the charset was `charset=iso-8859-7` – net_L Mar 20 '14 at 14:45

2 Answers2

1

You're doing it wrong (tm).

Seriously, though, you are going about trying to solve this problem in completely the wrong way. Don't use a StreamReader for this. And especially don't read 1 byte at a time (as you said you needed to do in a comment on an earlier "solution").

For an explanation of why not to use a StreamReader, besides the obvious "because it isn't designed to switch between encodings during the process of reading", feel free to read over another answer I gave about the inefficiencies of using a StreamReader here: Reading an mbox file in C#

What you need to do is buffer your reads (such as a 4k buffer should be fine). Then, as you are already having to do anyway, scan for the '\n' byte to extract content on a line-by-line basis, combining header lines that were folded.

Each header may have multiple encoded-word tokens which may each be in a separate charset, assuming they are properly encoded, otherwise you'll have to deal with undeclared 8-bit data and try to massage that into unicode somehow (probably by having a set of fallback charsets). I'd recommend trying UTF-8 first followed by a selection of charsets that the user of your library has provided before finally trying iso-8859-1 (make sure not to try iso-8859-1 until you've tried everything else, because any sequence of 8-bit text will convert properly to unicode using the iso-8859-1 character encoding).

When you get to text content of the message, you'll want to check the Content-Type header for a charset parameter. If no charset parameter is defined, it should be US-ASCII, but in practice it could be anything. Even if the charset is defined, it might not match the actual character encoding used in the text body of the message, so once again you'll probably want to have a set of fallbacks.

As you've probably guessed by this point, this is very clearly not a trivial task as it requires the parser to do on-the-fly character conversion as it goes (and the character conversion requires internal parser state about what the expected charset is at any given time).

Since I've already done the work, you should really consider using MimeKit which will parse the email and properly do charset conversion on the headers and the content using the appropriate charset encoding.

I've also written a Pop3Client class that is included in my MailKit library.

If your goal is to learn and write your own library, I'd still highly recommend reading over my code because it is highly efficient and does things in a proper way.

Community
  • 1
  • 1
jstedfast
  • 35,744
  • 5
  • 97
  • 110
  • Self promotion is fine, but you really need to provide some details on how to do it "the right way" here on this site. If you gave a basic summery of how to do it the right way then said "*I have done all this work for you already in my library ...*" that would be fine. But just saying "*You are not doing it right, just use my library*" is a borderline answer (and depending on who you ask it may be borderline acceptable or borderline unacceptable) – Scott Chamberlain Mar 31 '14 at 20:52
  • I figured it was obvious why using a StreamReader for a "text" stream that could change charset multiple times was not an ideal solution (I mean, in order to do it that way, he has to read 1 byte at a time which is extremely inefficient). – jstedfast Mar 31 '14 at 20:55
  • The new version is much better. – Scott Chamberlain Mar 31 '14 at 23:06
0

There are some ways you can detect the encoding by looking at the Byte Order Mark, which are the firts few bytes of the stream. These will tell you the encoding. However, the stream might not have a BOM, and in these cases it could be ASCII, UTF without BOM, or others.

You can convert your stream from one encoding to another with the Encoding Class:

Encoding textEncoding = Encoding.[your detected encoding here];
byte[] converted = Encoding.UTF8.GetBytes(textEncoding.GetString(TCPStream.GetBuffer()));

You may select your preferred encoding when converting.

Hope it answers your question.

edit
You may use this code to read your stream in blocks.

MemoryStream st = new MemoryStream();
int numOfBytes = 1024;
int reads = 1;
while (reads > 0)
{
    byte[] bytes = new byte[numOfBytes];
    reads = yourStream.Read(bytes, 0, numOfBytes);
    if (reads > 0)
    {
        int writes = ( reads < numOfBytes ? reads : numOfBytes);
        st.Write(bytes, 0, writes);
    }
}
Stephen Jennings
  • 12,494
  • 5
  • 47
  • 66
Ricardo Appleton
  • 679
  • 10
  • 22
  • The TCPStream is a StreamReader which has a NetworkStream as a BaseStream through a TCPClient. Only MemoryStream has a GetBuffer method which I cannot use in my case, or I don't know how to do so. – net_L Mar 20 '14 at 14:22
  • Yo could read your stream into a byte array by doing TCPStream .Read(....)? You'd declare `byte[] stBytes = new byte[TCPStream.length];` Then you may read your bytes into a MemoryStream – Ricardo Appleton Mar 20 '14 at 14:52
  • With how many bytes? I have to read the NetworkStream per line and when I encouter the '.' that means it's the end of the e-mail message. I will try to do so and when I get the line feed character check the string back, and will report back. Thank you `byte[] stBytes = new byte[TCPStream.length];` TCPStream.Length will throw an exception as it reads from a NetworkStream. [link](http://msdn.microsoft.com/en-us/library/system.net.sockets.networkstream.length(v=vs.110).aspx) – net_L Mar 20 '14 at 14:56
  • I'll post a piece of code I've got that reads a stream into an array in blocks of 1k. You could then adjust it to your needs – Ricardo Appleton Mar 20 '14 at 15:22
  • I have achieved to make it work by reading only 1 byte at a time, because I need to check the new line feed. Thank you for your guidance it solved my problem. – net_L Mar 20 '14 at 16:24