Strange behavior creating multiple StreamReader on the same Stream

Question

I'm using a finite-state machine to read a extra large file. It's not multi-threaded, so there won't be any problem of thread safety.

It contains 3 kinds of content:

binary number, indicates the length of the following string, counts a character as 1
ANSI, takes 1~2 Bytes for a character
UTF-8, takes 1~4 Bytes for a character

I've found this question that might be useful, but it failed. The similiar python question is neither useful, because it won't throw any error. I have to read the content with proper encoding, or the behavior will go unknown.

Currently, i'm using StreamReader, but the CurrentEncoding property cannot be changed, once the StreamReader is initialized.

So i've also tried to recreate the StreamReader on the same Stream:

reader = new StreamReader(stream, encoding65001); //UTF-8
DoSomething(reader);
reader = new StreamReader(stream, encoding1252); //ANSI
DoSomething(reader);
reader = new StreamReader(stream, encoding936); //ANSI

//...

But it starts to read strange content from an unknown position. I haven't find out the possible cause for this strange behavior.

Have I made mistake on creating multiple StreamReader, or it is designed not to create multiple on the same stream?

If it is designed so, is there any solution for reading such file?

Thank you for the time reading.

Edit: I've run the following code on .NET Core 3.1:

Stream stream = File.OpenRead(testFilePath);
Console.WriteLine(stream.Position);
Console.WriteLine(stream.ReadByte());
Console.WriteLine(stream.Position + "\r\n");

StreamReader reader = new StreamReader(stream, Encoding.UTF8);
Console.WriteLine(reader.Read());
Console.WriteLine(stream.Position + "\r\n");

reader = new StreamReader(stream, CodePagesEncodingProvider.Instance.GetEncoding(1252));
Console.WriteLine(reader.Read());
Console.WriteLine(stream.Position);

With the example text of following:

abcdefg

And the output:

It's strange and interesting.

If you print out `stream.Position` after each `DoSomething` are the values 'correct'? E.g. after reading a single UTF-8 byte, the position should be 1 (or perhaps 2 depending on the byte). — Neil, Nov 13 '20 at 15:33
@Neil No. By the first recreating it bumps 70 bytes away afterwards, and by the second it was over 1000+ bytes. But between the two runs, the value stays unchanged. I'll try to find out the relations between those numbers. — apflu, Nov 13 '20 at 15:40

score 1 · Accepted Answer · answered Nov 13 '20 at 16:03

The stream readers are going to buffer the content from the underlying stream they're reading, which is what's causing your problems. Just because you read one character from your reader doesn't mean it'll read just one character from the underlying stream. It'll fill a while buffer with bytes, and then yield you one character from the buffer.

If you want to be reading values from a stream and interpreting different sections of bytes as different encodings (for the record, if at all possible you should avoid putting yourself in this position of having mixed encodings in your data) you'll have to pull the bytes out of the stream yourself and then convert the bytes using the appropriate encodings, so that you can be sure you only pull the exact sections of bytes you want and no more.

Thank you! That's why the position of the stream went afterwards. If the stream is small enough, the StreamReader will read all of the contents from it, and moves till the end of the stream. If I recreate it, the Stream has its position already moved by a lot. This is the answer we need! — apflu, Nov 13 '20 at 16:10

Strange behavior creating multiple StreamReader on the same Stream

1 Answers1