0

This question may reveal my ignorance regarding character encoding, so if it does, I would greatly appreciate information to correct that.

I am relaying strings from new applications to an old application. The old application only accepts ASCII characters (http://www.asciitable.com/). The old application also does not support certain characters such as backslashes. The new applications support more or less anything.

Let's say I have the string:

"Whatever - 1_夜_"

I need to convert that to something with only ASCII characters. For example, maybe something like:

"Whatever - 1_\u001cY_=???=???=???"

Then I want to replace the remaining illegal characters with substitution strings.

Ideally, any character that is encoded to ASCII should be able to be de-coded. That is, any unique input string will have a unique output string (no arbitrary inputs "abc" and "xyz" which are different produce the same result). An algorithm could convert the output string back to the input string.

This is what I've tried:

static string ConvertToAscii(string str)
    {
        var return_string = "";

        foreach (var c in str)
        {
            if ((int)c < 128)
            {
                return_string += c;
            }
            else
            {
                var charBytes = BitConverter.GetBytes(c);
                var ascii = Encoding.ASCII.GetString(charBytes);
                return_string += ascii;
            }
        }

        return return_string;
    }

When I use this with the string I mentioned above, I get:

"Whatever - 1_\u001cY_=???=???=???"

That seems great - however, the "\u001cY" is apparently a single character, rather than a collection of ASCII characters. So my target database rejects it, and I am not able to figure out how to remove the "\" while leaving the remaining characters.

How can I convert any string into a collection of ASCII characters?

MattHH
  • 57
  • 4
  • 1
    Have you seen this [topic](https://stackoverflow.com/questions/4352209/conversion-from-utf8-to-ascii) ? – Alexander I. Dec 22 '17 at 19:53
  • "any character that is encoded to ASCII should be able to be de-coded" - sample showing how you want to represent characters outside of 0-127 range (ASCII) would help a lot for someone to come up with an answer. – Alexei Levenkov Dec 22 '17 at 19:59
  • When you say...So my target database rejects it... Are you sure the old software works with non-printable ASCII characters?. The first 32 chars in ASCII are non-printable, so that's why you are getting the actual representation of it. Those are called control characters. – Raudel Ravelo Dec 22 '17 at 20:07
  • My bad, now I realized the problem is with the returned representation that comes with the backslash added to it and that is probably why you are getting the error. That one \u001c is number 28 in the ASCII table. – Raudel Ravelo Dec 22 '17 at 20:10

2 Answers2

1

The easiest approach is Base64 all bytes since you don't seem to care how strings are represented:

    Convert.ToBase64String( Encoding.Unicode.GetBytes("Whatever - 1_夜_"))

will produce result that is guaranteed to be ASCII (even printable ASCII) - for your string result would be "VwBoAGEAdABlAHYAZQByACAALQAgADEAXwAcWV8APdim3D3Yptw92Kbc".

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
0

Here is similar code to what I ended up using to convert everything to Ascii:

internal static string ConvertToAscii(string str)
{
    var returnStringBuilder = new StringBuilder();

    foreach (var c in str)
    {
        if (char.IsControl(c))
        {
            // Control character
            continue;
        }
        if (c < 127)
        {
            // ASCII Character
            returnStringBuilder.Append(c);
        }
        else
        {
            returnStringBuilder.Append("U+" + ((int) c).ToString("X4"));
        }
    }

    return returnStringBuilder.ToString();
}
MattHH
  • 57
  • 4
  • 1
    This does not meet your requirement for unique decoding. Here are two inputs that have the same output: "Unicode Character 'EURO SIGN' (U+20AC)" and "Unicode Character 'EURO SIGN' (€)". – Tom Blodget Jun 26 '18 at 20:52
  • Good point Tom. It does not seem like there is any way to meet that requirement, unless certain character(s) were reserved for indicating substitutions. – Matthew Hostetler Jun 27 '18 at 21:57