5

I'm trying to figure out a way to parse out a base64 string from with a larger string.

I have the string "Hello <base64 content> World" and I want to be able to parse out the base64 content and convert it back to a string. "Hello Awesome World"

Answers in C# preferred.

Edit: Updated with a more real example.

--abcdef
\n
Content-Type: Text/Plain;
Content-Transfer-Encoding: base64
\n
<base64 content>
\n
--abcdef--

This is taken from 1 sample. The problem is that the Content.... vary quite a bit from one record to the next.

Adam
  • 3,014
  • 5
  • 33
  • 59
  • 1
    Is the base64 content delimited in any way? – jball Oct 04 '10 at 18:22
  • 1
    This is an XY problem. The real problem is X: how did you end up with a string like that. – Hans Passant Oct 04 '10 at 19:18
  • @Hans Passant I agree, I am trying to write a tool to fix some data that was corrupt somehow in the first place. We have already fixed the part making the corrupted data, but now we have to fix it on approximately 3 million records. – Adam Oct 04 '10 at 20:39
  • Well, document the bug. Only if we know how it screwed up will we have a guess at how to unscrew it. We need to know X. – Hans Passant Oct 04 '10 at 20:41
  • @jball It appears there may actually be a delimiter. Wrapping all of the base64 and non-base64 content is a "\n" and wrapping the base64 part within the non-base64 part appears to be another "\n". I'll update the example above. – Adam Oct 04 '10 at 20:44
  • @Hans Passant Unfortunately I don't know the "how" part. We were using some 3rd party tool in conjunction with some custom MIME string parsing that was done by "inspection and guesswork" by a previous developer to save emails from Lotus Notus into a document management system which has since been removed and replaced with Redemption and Outlook (solving our problem). When trying to use the same tool that saved the emails in the first place the content always comes back as empty string even though I can see the actual message contents in the debugger. – Adam Oct 04 '10 at 20:53
  • Does `"Content-Transfer-Encoding: base64"` always precede the base64 content and never occurs immediately before plain text? Also, is the base64 content free of `\n`s? – jball Oct 04 '10 at 21:14
  • @jball The base64 content never contains any \n's but sometimes the Content-Transfer-Encoding does not immediately follow the \n of the base64 content. One example I found has "X-NAIMIME-Modified: 1" and have seen others with other strings. At 3 million records that I need to parse I can't be certain how many would match a specific format, though it may have to be an option to parse for the Content-Transfer-Encoding as a delimiter and wait for more complaints and fix those as a delimeter is discovered. – Adam Oct 04 '10 at 21:29
  • I would make the assumption that they match that pattern, and then add other possibilities as you receive complaints. – jball Oct 04 '10 at 21:41

2 Answers2

8

There is no reliable way to do it. How would you know that, for instance, "Hello" is not a base64 string ? OK, it's a bad example because base64 is supposed to be padded so that the length is a multiple of 4, but what about "overflow" ? It's 8-character long, it is a valid base64 string (it would decode to "¢÷«~Z0"), even though it's obviously a normal word to a human reader. There's just no way you can tell for sure whether a word is a normal word or base64 encoded text.

The fact that you have base64 encoded text embedded in normal text is clearly a design mistake, I suggest you do something about it rather that trying to do something impossible...

Thomas Levesque
  • 286,951
  • 70
  • 623
  • 758
4

In short form you could:

  • split the string on any chars that are not valid base64 data or padding
  • try to convert each token
  • if the conversion succeeds, call replace on the original string to switch the token with the converted value

In code:

var delimiters = new char[] { /* non-base64 ASCII chars */ };
var possibles = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
//need to tweak to include padding chars in matches, but still split on padding?
//maybe better off creating a regex to match base64 + padding
//and using Regex.Split?

foreach(var match in possibles)
{
    try
    {
        var converted = Convert.FromBase64String(match);
        var text = System.Text.Encoding.UTF8.GetString(converted);
        if(!string.IsNullOrEmpty(text))
        {
            value = value.Replace(match, text);
        }
    } 
    catch (System.ArgumentNullException) 
    {
        //handle it
    }
    catch (System.FormatException) 
    {
        //handle it
    }
}

Without a delimiter though, you can end up converting non-base64 text that happens to be also be valid as base64 encoded text.

Looking at your example of trying to convert "Hello QXdlc29tZQ== World" to "Hello Awesome World" the above algorithm could easily generate something like "ée¡Ý•ͽµ”¢¹]" by trying to convert the whole string from base64 since there is no delimiter between plain and encoded text.

Update (based on comments):

If there are no '\n's in the base64 content and it is always preceded by "Content-Transfer-Encoding: base64\n", then there is a way:

  • split the string on '\n'
  • iterate over all the tokens until a token ends in "Content-Transfer-Encoding: base64"
  • the next token (if there are any) should be decoded (if possible) and then the replacement should be made in the original string
  • return to iterating until out of tokens

In code:

private string ConvertMixedUpTextAndBase64(string value)
{
    var delimiters = new char[] { '\n' };
    var possibles = value.Split(delimiters, 
                                StringSplitOptions.RemoveEmptyEntries);

    for (int i = 0; i < possibles.Length - 1; i++)
    {
        if (possibles[i].EndsWith("Content-Transfer-Encoding: base64"))
        {
            var nextTokenPlain = DecodeBase64(possibles[i + 1]);
            if (!string.IsNullOrEmpty(nextTokenPlain))
            {
                value = value.Replace(possibles[i + 1], nextTokenPlain);
                i++;
            }
        }                
    }
    return value;
}

private string DecodeBase64(string text)
{
    string result = null;
    try
    {
        var converted = Convert.FromBase64String(text);
        result = System.Text.Encoding.UTF8.GetString(converted);
    }
    catch (System.ArgumentNullException)
    {
        //handle it
    }
    catch (System.FormatException)
    {
        //handle it
    }
    return result;
}
jball
  • 24,791
  • 9
  • 70
  • 92
  • 2
    The last part is the tricky part. For instance, if you split and obtain "aaBG" as your string, what do you do? This is the base64 representation of "i F". You'd need some heuristic to decide which is the one you actually want. – Yuliy Oct 04 '10 at 18:32