8

So I've got some troubles with character encoding. When I put the following two characters into a UTF32 encoded text file:

and then run this code on them:

System.IO.StreamReader streamReader = 
    new System.IO.StreamReader("input", System.Text.Encoding.UTF32, false);
System.IO.StreamWriter streamWriter = 
    new System.IO.StreamWriter("output", false, System.Text.Encoding.UTF32);
    
streamWriter.Write(streamReader.ReadToEnd());

streamWriter.Close();
streamReader.Close();

I get:

鸕
鸕

(same character twice, i.e the input file != output)

A few things that might help: Hex for the first character:

15 9E 02 00

And for the second:

15 9E 00 00

I am using gedit for the text file creation, mono for the C# and I'm using Ubuntu.

It also doesn't matter if I specify the encoding for the input or output file, it just doesn't like it if it's in UTF32 encoding. It works if the input file is in UTF-8 encoding.

The input file is as follows:

FF FE 00 00 15 9E 02 00 0A 00 00 00 15 9E 00 00 0A 00 00 00

Is it a bug, or is it just me?

Thanks!

Andrew
  • 7,602
  • 2
  • 34
  • 42
AStupidNoob
  • 1,980
  • 3
  • 23
  • 35
  • Encoding of output file? – L.B Apr 03 '12 at 05:54
  • Print out the result of `streamReader.ReadToEnd()`. – leppie Apr 03 '12 at 05:56
  • @L.B - Changing it doesn't help – AStupidNoob Apr 03 '12 at 05:58
  • @leppie - It sure looks like the problem is in the reading: "鸕\n鸕" – AStupidNoob Apr 03 '12 at 05:59
  • What have you done by way of debugging? For instance, try putting the result mof `streamReader.ReadToEnd()` into a string, and then check that. It should be the UTF-16 encoded version of the input. – Mr Lister Apr 03 '12 at 06:06
  • See 4th comment, that's exactly what I did. The problem is in the reading. If the file is saved in UTF8, and there is no encoding specified, the file is read and written correctly – AStupidNoob Apr 03 '12 at 06:09
  • How do you mean you get "鸕鸕"? Where are you reading this output? – Chibueze Opata Apr 03 '12 at 06:58
  • @Chibueze Opata - I'm reading it using the debugger, by assigning a variable to the value of streamReader.ReadToEnd(). – AStupidNoob Apr 03 '12 at 07:08
  • That means you're not reading the file correctly. The input encoding is not in the UTF-32 you specified, try to detect the encoding automatically instead. See my answer below – Chibueze Opata Apr 03 '12 at 07:13
  • @AStupidNoob if you use a hex editor to look at the input file, what values does it contain? (Just the first 16 will do.) It could be that the file is UTF-32 (LE) after all, but the StreamReader constructor mistakes the first two bytes of the BOM for UTF-16 (LE). That would be a horrible bug. – Mr Lister Apr 03 '12 at 12:46
  • @Mr Lister I have edited the question with the input file and some new, clearer code that directly specifies that the input is in UTF32, overriding whatever the preamble says. I find it strange that gedit will open `input` and save it, no problems, but my small annoying code just won't... – AStupidNoob Apr 04 '12 at 03:18

5 Answers5

6

K, so I figured it out I think, it seems to work now. Turns out, since the codes for the characters were 15 9E 02 00 and 15 9E 00 00, then there's no way that they can be held in one, single UTF-16 char. So, instead UTF16 uses these surrogate pairs things where there's two different characters that act as one 'element'. To get elements, we can use:

StringInfo.GetTextElementEnumerator(string fred);

and this returns a string with the surrogate pairs. Treat it as one character.

See here:

http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx

http://msdn.microsoft.com/en-us/library/system.globalization.textelementenumerator.gettextelement.aspx

Hope it helps someone :D

AStupidNoob
  • 1,980
  • 3
  • 23
  • 35
1

I tried this and it works well on my PC.

System.IO.StreamReader streamReader = new System.IO.StreamReader("input", true);
System.IO.StreamWriter streamWriter = new System.IO.StreamWriter("output", false);

streamWriter.Write(streamReader.ReadToEnd());

streamWriter.Close();
streamReader.Close();

Maybe the text you think is in UTF32 is not.

Chibueze Opata
  • 9,856
  • 7
  • 42
  • 65
  • Are you using Visual Studio/Windows? It might just be mono if not. I'll try other programs to make sure it is indeed UTF32, it certainly looks like it in a hex editor... – AStupidNoob Apr 03 '12 at 07:22
  • Ok, good luck. But your code produced a wrong output as well on my PC. – Chibueze Opata Apr 03 '12 at 07:25
  • 1
    Oh, sorry I didn't notice the change in your code. In other news, using visual studio 2012 beta resulted in the correct output with my code... – AStupidNoob Apr 03 '12 at 07:39
0

When writing you're not specifying UTF-32 so it defaults to Encoding.UTF8.

From MSDN:

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM), so its GetPreamble method returns an empty byte array. To create a StreamWriter using UTF-8 encoding and a BOM, consider using a constructor that specifies encoding, such as StreamWriter(String, Boolean, Encoding).

Sani Huttunen
  • 23,620
  • 6
  • 72
  • 79
  • That doesn't seem to be the problem. I've updated the question to help remove any confusion. Thanks anyway though! – AStupidNoob Apr 03 '12 at 06:04
0

I think you need to specify the same encoding (Encoding.UTF32) also for your StreamWriter.

EDIT:

Normally it is not needed between UTF codepages but I would also try this:

Encoding utf8 = Encoding.UTF8;
Encoding utf32 = Encoding.UTF32;
byte[] utf8Bytes = utf8.GetBytes(yourText);
byte[] utf32Bytes = Encoding.Convert(utf8, utf32, utf8Bytes);
string utf32Text = utf32.GetString(utf32Bytes);
Andrew
  • 7,602
  • 2
  • 34
  • 42
Dummy01
  • 1,985
  • 1
  • 16
  • 21
  • I have :D, I just edited the question. Also it wouldn't really matter anyway, since any UTF-32 character can be expressed in UTF-8 or any Unicode encoding for that matter. AFAIK, anyway. – AStupidNoob Apr 03 '12 at 06:08
  • @AStupidNoob I just read your updated answer and your comments. If you know what encoding is the read file and it is other than UTF32 then you have to read it in its original encoding and convert it to the own you want before writing it. – Dummy01 Apr 03 '12 at 06:48
  • Thanks for your help again. I tried your suggestion, but I couldn't get it working D:. Also, I thought the entire purpose of StringReaders and StringWriters was to convert between encodings. Maybe not then. – AStupidNoob Apr 03 '12 at 07:20
0

From the Remarks section of MSDN for StreamReader's constructor:

This constructor initializes the encoding as specified by the encoding parameter, and the internal buffer size to 1024 bytes. The StreamReader object attempts to detect the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.

Very likely the byte order marks at the beginning of your file are actually indicating UTF 16 (or something), and so it's not using your explicitly stated UTF 32 encoding.

Tanzelax
  • 5,406
  • 2
  • 25
  • 28
  • Sure why not, I'll try using some other programs to ensure I'm getting the correct BOM. – AStupidNoob Apr 03 '12 at 07:23
  • @AStupidNoob it looks like there's a constructor overload that will not look at the BOM by adding a boolean parameter, could try that if you don't have another program on hand to check. – Tanzelax Apr 03 '12 at 07:34
  • Right, I would have thought that specifying the encoding would have ensured it was used, obviously not then. I did, however, try using windows for this and it worked. But, I was not able to verify its UTF32 output since I don't have any windows programs that play well with UTF32, so I swapped it to output in UTF8. – AStupidNoob Apr 03 '12 at 07:42