Convert string to ASCII without exceptions (like TryParse)

Question

I am implementing a TryParse() method for an ASCII string class. The method takes a string and converts it to a C-style string (i.e. a null-terminated ASCII string).

I had been using only a Parse(), doing the conversion to ASCII using::

public static bool Parse(string s, out byte[] result)
{
    result = null;
    if (s == null || s.Length < 1)
        return false;

    byte[]d = new byte[s.Length + 1]; // Add space for null-terminator
    System.Text.Encoding.ASCII.GetBytes(s).CopyTo(d, 0); 
    // GetBytes can throw exceptions 
    // (so can CopyTo() but I can replace that with a loop)
    result = d;
    return true;
}

However, as part of the idea of a TryParse is to remove the overhead of exceptions, and GetBytes() throws exceptions, I'm looking for a different method that does not do so.

Maybe there is a TryGetbytes()-like method?

Or maybe we can reason about the expected format of a standard .Net string and perform the change mathematically (I'm not overly familiar with UTF encodings)?

EDIT: I guess for non-ASCII chars in the string, the TryParse() method should return false

EDIT: I expect when I get around to implementing the ToString() method for this class I may need to do the reverse there.

`GetBytes` throws mainly one exception `ArgumentNullException` ,and you can check that easily... — Pikoh, Jul 21 '17 at 11:28
I don't think ASCII conversion ever throw a FallBack exception which (aside from an NRE) is the only thing GetBytes will throw as all single bytes are convertible in a single byte encoding. — Alex K., Jul 21 '17 at 11:28
@Pikoh - indeed I can, but as I say the idea is to remove the overhead of the exceptions (well most of the idea) — Toby, Jul 21 '17 at 11:30
So, check `MyString!=null` before `GetBytes`, and you won't get an exception? I'm not sure i understand — Pikoh, Jul 21 '17 at 11:31
I read the reason for the `TryParse()` benefit over `Parse` was, partly to remove the overhead of the exceptions and the `try... catch`. If I have something that can possibly throw exceptions then I need to `try... catch` it. I'd like to avoid this (not just hide it from the caller). — Toby, Jul 21 '17 at 11:33
It's very unclear what you're actually trying to "parse" here. (This doesn't sound like the normal meaning of the word parse to me.) Please provide a [mcve]. What do you want to happen with any non-ASCII characters? — Jon Skeet, Jul 21 '17 at 11:36
@JonSkeet Updated. Non-ASCII chars should... (forgot to consider them!) ... cause the `TryParse()` to return `false`. — Toby, Jul 21 '17 at 11:43
@Toby: If you want Non-ASCII characters to cause it to fail then using `Encoding.ASCII.GetBytes(s)` will not work for you - this causes non-ascii characters to be replaced by a "?". Jon's manual approach is therefore probably what you want... — Chris, Jul 21 '17 at 11:59

score 2 · Answer 1 · edited Jul 21 '17 at 12:01

2

There are two possible exceptions that Encoding.GetBytes might throw according to the documentation.

ArgumentNullException is easily avoided. Do a null check on your input and you can ensure this is never thrown.

EncoderFallbackException needs a bit more investigation... Reading the documentation:

A fallback strategy determines how an encoder handles invalid characters or how a decoder handles invalid bytes.

And if we looking in the documentation for ASCII encoding we see this:

It uses replacement fallback to replace each string that it cannot encode and each byte that it cannot decode with a question mark ("?") character.

That means it doesn't use the Exception Fallback and thus will never throw an EncoderFallbackException.

So in summary if you are using ASCII encoding and ensure you don't pass in a null string then you will never have an exception thrown by the call to GetBytes.

edited Jul 21 '17 at 12:01

Pikoh

7,582
28
53

answered Jul 21 '17 at 11:52

Chris

27,210
6
71
92

1

I edited your answer, to make the quotes clearer,hope you don't mind – Pikoh Jul 21 '17 at 12:01
Thank you Chris, this is a dang good answer for my question in its original form. As you commented though, Jon's provides the error-ing on non-ASCII chars, which I had not considered until he asked. – Toby Jul 21 '17 at 12:05
1

@Toby: Its a fact of life round here. Jon's answers are *always* better. ;-) – Chris Jul 21 '17 at 12:24

score 2 · Accepted Answer · answered Jul 21 '17 at 11:54

Two options:

You could just ignore Encoding entirely, and write the loop yourself:

public static bool TryParse(string s, out byte[] result)
{
    result = null;
    // TODO: It's not clear why you don't want to be able to convert an empty string
    if (s == null || s.Length < 1)
    {
        return false;
    }

    byte buffer = new byte[s.Length + 1]; // Add space for null-terminator
    for (int i = 0; i < s.Length; i++)
    {
        char c = s[i];
        if (c > 127)
        {
            return false;
        }
        buffer[i] = (byte) c;
    }
    result = buffer;
    return true;
}

That's simple, but may be slightly slower than using Encoding.GetBytes.

The second option would be to use a custom EncoderFallback:

public static bool TryParse(string s, out byte[] result)
{
    result = null;
    // TODO: It's not clear why you don't want to be able to convert an empty string
    if (s == null || s.Length < 1)
    {
        return false;
    }

    var fallback = new CustomFallback();
    var encoding = new ASCIIEncoding { EncoderFallback = fallback };
    byte buffer = new byte[s.Length + 1]; // Add space for null-terminator
    // Use overload of Encoding.GetBytes that writes straight into the buffer
    encoding.GetBytes(s, 0, s.Length, buffer, 0);
    if (fallback.HadErrors)
    {
        return false;
    }
    result = buffer;
    return true;
}

That would require writing CustomFallback though - it would need to basically keep track of whether it had ever been asked to handle invalid input.

If you didn't mind an encoding processing the data twice, you could call Encoding.GetByteCount with a UTF-8-based encoding with a replacement fallback (with a non-ASCII replacement character), and check whether that returns the same number of bytes as the number of chars in the string. If it does, call Encoding.ASCII.GetBytes.

Personally I'd go for the first option unless you have reason to believe it's too slow.

Thanks Jon, as I say I hadn't considered what should happen when non-ASCII characters are encountered until you mentioned it. Which I'm sure I would otherwise have regretted down the line! — Toby, Jul 21 '17 at 12:06
@Toby: Well if you didn't consider non-ASCII characters, in what way could `TryParse` actually fail? — Jon Skeet, Jul 21 '17 at 13:11
@JonSkeet I have a question but I don't know how to tag you there. Can you look at it [here](https://stackoverflow.com/questions/45181148/how-to-open-default-email-client-with-attachment) — Rich, Jul 21 '17 at 13:22
@JoshuaAlzate: Please don't use comments on one question/answer to talk about an *entirely unrelated* post. — Jon Skeet, Jul 21 '17 at 13:28
@JonSkeet TBH I don't know. I did try looking up `GetBytes` in the reference source before asking to see if I could understand how it works and adapt parts for my needs, but the actual implementation is not there as far as I understood it. I didn't really get as far as considering *what* might fail as (as mentioned) my understanding of UTF is lacking (I'm an embedded C FW guy, character sets are not something we use much) so I wasn't sure how the conversion would happen, nor what errors may arise during such. At the least I probably should have read the documentation more closely. RTFM I guess — Toby, Jul 21 '17 at 14:55

Kevin Li · Answer 3 · 2017-07-21T11:58:28.050

1

The GetBytes method is throwing an exception because your Encoding.EncoderFallback specifies that it should throw an exception.

Create an encoding object with EncoderReplacementFallback to avoid exceptions on unencodable characters.

Encoding encodingWithFallback = new ASCIIEncoding() { DecoderFallback = DecoderFallback.ReplacementFallback };
encodingWithFallback.GetBytes("Hɘ££o wor£d!");

This way imitates the TryParse methods of the primitive .NET value types:

bool TryEncodingToASCII(string s, out byte[] result)
{
    if (s == null || Regex.IsMatch(s, "[^\x00-\x7F]")) // If a single ASCII character is found, return false.
    {
        result = null;
        return false;
    }
    result = Encoding.ASCII.GetBytes(s); // Convert the string to ASCII bytes.
    return true;
}

edited Jul 21 '17 at 11:58

answered Jul 21 '17 at 11:28

Kevin Li

303
2
12

Hmm, this would indeed prevent the method throwing the fallback exception, but it would not stop in case of error and my method would not be aware that anything had gone awry, right? – Toby Jul 21 '17 at 11:34
1

That’s right. Without an exception, you would not know using the `Encoding` class. In that case, I would create a regular expression with a pattern that matches all non-ASCII characters, and use `RegEx.Replace(string, MatchEvaluator)` to handle each case with custom code. This way, you can definitely avoid exceptions for unknown characters. – Kevin Li Jul 21 '17 at 11:40
1

@KevinLi: You are wrong about the behaviour of the ASCII encoder. It uses a replacement fallback strategy rather than an exception. Links to relevant docs in my answer. – Chris Jul 21 '17 at 11:53
1

@Chris: You’re right. The object returned by the `Encoding.ASCII` property does not feature this behavior. – Kevin Li Jul 21 '17 at 11:56

Convert string to ASCII without exceptions (like TryParse)

3 Answers3