4

I have the following string:

=?utf-8?Q?=5Bproconact_=2D_Verbesserung_=23=32=37=39=5D_=28Neu=29_Stellvertretungen_Benutzerrecht_=2D_andere_k=C3=B6nnen_f=C3=BCr_andere_Stellvertretungen_erstellen_=C3=A4ndern_usw=2E_dadurch_ist_der_Schutz_der_Aktivi=C3=A4ten_Mails_nicht_gew=C3=A4hrt=...

which is an encoding of

[proconact-Verbesserung #279] (Neu) Stellvertretungen Benutzerrecht - andere können für andere Stellvertretungen erstellen ändern usw. dadurch ist der Schutz der Aktiviäten Mails nicht gewährt.

I am searching for a way do decode the quoted string.

I have tried:

private static string DecodeQuotedPrintables(string input, string charSet) {
    Encoding enc = new ASCIIEncoding();
    try {
        enc = Encoding.GetEncoding(charSet);
    } catch {
        enc = new UTF8Encoding();
    }

    var occurences = new Regex(@"(=[0-9A-Z]{2}){1,}", RegexOptions.Multiline);
    var matches = occurences.Matches(input);

    foreach (Match match in matches) {
        try {
            byte[] b = new byte[match.Groups[0].Value.Length / 3];
            for (int i = 0; i < match.Groups[0].Value.Length / 3; i++) {
                b[i] = byte.Parse(match.Groups[0].Value.Substring(i * 3 + 1, 2), System.Globalization.NumberStyles.AllowHexSpecifier);
            }
            char[] hexChar = enc.GetChars(b);
            input = input.Replace(match.Groups[0].Value, hexChar[0].ToString());
        } catch { ;}
    }
    input = input.Replace("?=", "").Replace("=\r\n", "");

    return input;
}

when I call (where s is my string)

var x = DecodeQuotedPrintables(s, "utf-8");

this will return

=?utf-8?Q?[proconact_-_Verbesserung_#_(Neu)_Stellvertretungen_Benutzerrecht_-_andere_können_für_andere_Stellvertretungen_erstellen_ändern_usw._dadurch_ist_der_Schutz_der_Aktiviäten_Mails_nicht_gewährt=...

What can I do, that there will also the _ and the starting =?utf-8?Q? and the trailing =.. be removed?

abatishchev
  • 98,240
  • 88
  • 296
  • 433
BennoDual
  • 5,865
  • 15
  • 67
  • 153
  • 3
    This is evil: `try { ... } catch { ;}` – Mark Byers May 05 '12 at 08:00
  • what should you end up with? what is the final string that you're trying to take out from the original one? – balexandre May 05 '12 at 08:07
  • This is the original string which I should get: [proconact-Verbesserung #279] (Neu) Stellvertretungen Benutzerrecht - andere können für andere Stellvertretungen erstellen ändern usw. dadurch ist der Schutz der Aktiviäten Mails nicht gewährt. – BennoDual May 05 '12 at 08:11
  • Just a side note: Your source string looks like an a***ed up url-encoded string which could be easily decoded if it had not been mutilated by replacing url-encoded entities like `%23` with `_=23_`. If you cannot control the source string maybe un-replacing the source string and url-decoding it will simplify your method a great deal. – Filburt May 05 '12 at 08:32
  • @Filburt: The source string is a valid (almost) RFC 2047 encoded word; see my answer below. – Douglas May 05 '12 at 09:24
  • @Douglas Didn't know about that one. Good answer with every detail to solve this issue. +1 – Filburt May 05 '12 at 14:24

5 Answers5

5

The text you’re trying to decode is typically found in MIME headers, and is encoded according to the specification defined in the following Internet standard: RFC 2047: MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text.

There is a sample implementation for such a decoder on GitHub; maybe you can draw some ideas from it: RFC2047 decoder in C#.

You can also use this online tool for comparing your results: Online MIME Headers Decoder.

Note that your sample text is incorrect. The specification declares:

encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

Per the specification, any encoded word must end in ?=. Thus, your sample must be corrected from:

=?utf-8?Q?=5Bproconact_=2D_Verbesserung_=23=32=37=39=5D_=28Neu=29_Stellvertretungen_Benutzerrecht_=2D_andere_k=C3=B6nnen_f=C3=BCr_andere_Stellvertretungen_erstellen_=C3=A4ndern_usw=2E_dadurch_ist_der_Schutz_der_Aktivi=C3=A4ten_Mails_nicht_gew=C3=A4hrt=

…to (scroll to the far right):

=?utf-8?Q?=5Bproconact_=2D_Verbesserung_=23=32=37=39=5D_=28Neu=29_Stellvertretungen_Benutzerrecht_=2D_andere_k=C3=B6nnen_f=C3=BCr_andere_Stellvertretungen_erstellen_=C3=A4ndern_usw=2E_dadurch_ist_der_Schutz_der_Aktivi=C3=A4ten_Mails_nicht_gew=C3=A4hrt?=

Strictly speaking, your sample is also invalid because it exceeds the 75-character limit imposed on any encoded word; however, most decoders tend to be tolerant of this non-conformity.

Community
  • 1
  • 1
Douglas
  • 53,759
  • 13
  • 140
  • 188
3

I've tested 5+ of code snippets and this is the working one: I've modified the regex part:

Test line:

    im sistemlerimizde bak=FDm =E7al=FD=FEmas=FD yap=FDlaca=F0=FDndan; www.gib.=

Sample call:

    string encoding = "windows-1254";
    string input = "im sistemlerimizde bak=FDm =E7al=FD=FEmas=FD yap=FDlaca=F0=FDndan; www.gib.=";
    DecodeQuotedPrintables(input, encoding);

Code snippet:

    private static string DecodeQuotedPrintables(string input, string charSet)
    {


        System.Text.Encoding enc = System.Text.Encoding.UTF7;

        try
        {
            enc = Encoding.GetEncoding(charSet);
        }
        catch
        {
            enc = new UTF8Encoding();
        }



        ////parse looking for =XX where XX is hexadecimal
        //var occurences = new Regex(@"(=[0-9A-Z]{2}){1,}", RegexOptions.Multiline);
        var occurences = new Regex("(\\=([0-9A-F][0-9A-F]))", RegexOptions.Multiline);
        var matches = occurences.Matches(input);

        foreach (Match match in matches)
        {
            try
            {
                byte[] b = new byte[match.Groups[0].Value.Length / 3];
                for (int i = 0; i < match.Groups[0].Value.Length / 3; i++)
                {
                    b[i] = byte.Parse(match.Groups[0].Value.Substring(i * 3 + 1, 2), System.Globalization.NumberStyles.AllowHexSpecifier);
                }
                char[] hexChar = enc.GetChars(b);
                input = input.Replace(match.Groups[0].Value, hexChar[0].ToString());
            }
            catch
            { ;}
        }
        input = input.Replace("?=", "").Replace("=\r\n", "");

        return input;
    }
Nime Cloud
  • 6,162
  • 14
  • 43
  • 75
3

As mentioned at standard class .NET is exist for this purpose.

string unicodeString =
            "=?UTF-8?Q?YourText?=";
        System.Net.Mail.Attachment attachment = System.Net.Mail.Attachment.CreateAttachmentFromString("", unicodeString);
        Console.WriteLine(attachment.Name);
Community
  • 1
  • 1
hmfarimani
  • 531
  • 1
  • 8
  • 13
1

Following my comment I'd suggest

private static string MessedUpUrlDecode(string input, string encoding)
{
    Encoding enc = new ASCIIEncoding();

    try
    {
        enc = Encoding.GetEncoding(charSet);
    }
    catch
    {
        enc = new UTF8Encoding();
    }

    string messedup = input.Split('?')[3];
    string cleaned = input.Replace("_", " ").Replace("=...", ".").Replace("=", "%");

    return System.Web.HttpUtility.UrlDecode(cleaned, enc);
}

assuming that the mutilating of the source strings is consistent.

Filburt
  • 17,626
  • 12
  • 64
  • 115
-1

I am not too sure on how to remove the

=?utf-8?Q?

Unless it appears all the time, if it does, you can do this:

input = input.Split('?')[3];

To get rid of the trailing '=' you can remove it by:

input = input.Remove(input.Length - 1);

You can get rid of the '_' by replacing it with a space:

input = input.Replace("_", " ");

You can use those pieces of code in your DecodeQuotedPrintables function.

Hope this Helps!

Filburt
  • 17,626
  • 12
  • 64
  • 115
matthewr
  • 4,679
  • 5
  • 30
  • 40