3

Take a look at how it is possible to output all of the characters from a single byte character set printable or not. The output file will contain Japanese characters such as チホヤツセ.

Encoding enc = Encoding.GetEncoding("shift_jis");
byte[] m_bytes = new  byte [1];
StreamWriter sw = new StreamWriter(@"C:\shift_jis.txt");

for (int i = 0; i < 256; i++)
{
    m_bytes.SetValue ((byte)i,0);
    String Output = enc.GetString(m_bytes);
    sw.WriteLine(Output);
}

sw.Close();
sw.Dispose();

Here is my attempt to do this with a double byte character set.

Encoding enc = Encoding.GetEncoding("iso-2022-jp");
byte[] m_bytes = new byte[2];
StreamWriter sw = new StreamWriter(@"C:\iso-2022-jp.txt");

for (int i = 0; i < 256; i++)
{
    m_bytes.SetValue((byte)i, 0);

    for (int j = 0; j < 256; j++)
    {
        m_bytes.SetValue((byte)j, 1);
        String Output = null;
        Output = enc.GetString(m_bytes);
        sw.WriteLine(Output);
    }
}

sw.Close();
sw.Dispose();

The problem is the output file still only contains the first 255 characters. Each byte is evaluated separately and gives the character back for that byte individually. The output string always contains two characters and not one. Since characters in the character set are represented with two bytes you must have to specify them with two bytes right?

So how do you iterate through and print all characters from a double byte character set?

weston
  • 54,145
  • 21
  • 145
  • 203
Jake
  • 397
  • 2
  • 15

3 Answers3

1

If it is ok to have them in unicode order, you could:

Encoding enc = (Encoding)Encoding.GetEncoding("iso-2022-jp").Clone();
enc.EncoderFallback = new EncoderReplacementFallback("");
char[] chars = new char[1];
byte[] bytes = new byte[16];

using (StreamWriter sw = new StreamWriter(@"C:\temp\iso-2022-jp.txt"))
{
    for (int i = 0; i <= char.MaxValue; i++)
    {
        chars[0] = (char)i;
        int count = enc.GetBytes(chars, 0, 1, bytes, 0);

        if (count != 0)
        {
            sw.WriteLine(chars[0]);
        }
    }
}

If you want to order it by byte sequence, you could:

Encoding enc = (Encoding)Encoding.GetEncoding("iso-2022-jp").Clone();
enc.EncoderFallback = new EncoderReplacementFallback("");
char[] chars = new char[1];
byte[] bytes = new byte[16];

var lst = new List<Tuple<byte[], char>>();

for (int i = 0; i <= char.MaxValue; i++)
{
    chars[0] = (char)i;
    int count = enc.GetBytes(chars, 0, 1, bytes, 0);

    if (count != 0)
    {
        var bytes2 = new byte[count];
        Array.Copy(bytes, bytes2, count);
        lst.Add(Tuple.Create(bytes2, chars[0]));
    }
}

lst.Sort((x, y) =>
{
    int min = Math.Min(x.Item1.Length, y.Item1.Length);

    for (int i = 0; i < min; i++)
    {
        int cmp = x.Item1[i].CompareTo(y.Item1[i]);

        if (cmp != 0)
        {
            return cmp;
        }
    }

    return x.Item1.Length.CompareTo(y.Item1.Length);
});

using (StreamWriter sw = new StreamWriter(@"C:\temp\iso-2022-jp.txt"))
{
    foreach (var tuple in lst)
    {
        sw.WriteLine(tuple.Item2);

        // This will print the full byte sequence necessary to 
        // generate the char. Note that iso-2022-jp uses escape
        // sequences to "activate" subtables and to deactivate them.
        //sw.WriteLine("{0}: {1}", tuple.Item2, string.Join(",", tuple.Item1.Select(x => x.ToString("x2"))));
    }
}

or with a different sorting order (length first):

lst.Sort((x, y) =>
{
    int cmp2 = x.Item1.Length.CompareTo(y.Item1.Length);

    if (cmp2 != 0)
    {
        return cmp2;
    }

    int min = Math.Min(x.Item1.Length, y.Item1.Length);

    for (int i = 0; i < min; i++)
    {
        int cmp = x.Item1[i].CompareTo(y.Item1[i]);

        if (cmp != 0)
        {
            return cmp;
        }
    }

    return 0;
});

Note that in all the examples I'm only generating the chars of the basic BMP plane. I don't think that characters outside the basic BMP plane are included in any encoding... If necessary I can modify the code to support it.

Just out of curiousity, the first version of the code with handling of non-BMP characters (that aren't present in iso-2022-jp):

Encoding enc = (Encoding)Encoding.GetEncoding("iso-2022-jp").Clone();
enc.EncoderFallback = new EncoderReplacementFallback("");
byte[] bytes = new byte[16];

using (StreamWriter sw = new StreamWriter(@"C:\temp\iso-2022-jp.txt"))
{
    int max = -1;
    for (int i = 0; i <= 0x10FFFF; i++)
    {
        if (i >= 0xD800 && i <= 0xDFFF)
        {
            continue;
        }

        string chars = char.ConvertFromUtf32(i);

        int count = enc.GetBytes(chars, 0, chars.Length, bytes, 0);

        if (count != 0)
        {
            sw.WriteLine(chars);
            max = i;
        }
    }

    Console.WriteLine("maximum codepoint: {0}", max);
}
xanatos
  • 109,618
  • 12
  • 197
  • 280
  • This is perfect and works well. I sure wish we could all just use Unicode, but in the Medical industry if you want to sell products you have to support legacy systems. – Jake Aug 12 '15 at 16:08
1

You should use the writer configured to your encoding:

Encoding encoding = Encoding.GetEncoding("iso-2022-jp");
using (var stream = new FileStream(@"C:\iso-2022-jp.txt", FileMode.Create))
{
    using (StreamWriter writer = new StreamWriter(stream, encoding))
    {
        for (int i = 0; i <= char.MaxValue; i++)
        {
            // Each char goes separate line. One will be only 1 byte, others more with
            // the leading escape seq:
            writer.WriteLine(((char) i).ToString());
        }
    }
}
g.pickardou
  • 32,346
  • 36
  • 123
  • 268
  • This is very useful to me as well. I definitely have a use for listing them all in ASCII with escape sequences, as well as having the output in Unicode. I never expected such a quick and thorough response. – Jake Aug 12 '15 at 16:12
0

This is an issue with the specific encoding you chose.

ISO-2022 encodings cannot just be listed number by number isolated - this is not Unicode. What a specific set of bytes means is determined by Escape sequences in the stream of bytes.

From the Wikipedia article (ISO/IEC 2022):

To represent multiple character sets, the ISO/IEC 2022 character encodings include escape sequences which indicate the character set for characters which follow.

nepdev
  • 937
  • 1
  • 11
  • 19