Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding

Question

The abc.txt File Contents are

ABCDEFGHIJ•XYZ

Now, The Character Shown is Fine if I use this code (i.e. Seek to position 9),

            string filePath = "D:\\abc.txt";
            FileStream fs = new FileStream(filePath, FileMode.Open);
            StreamReader sr = new StreamReader(fs, new UTF8Encoding(true), true);
            sr.BaseStream.Seek(9, SeekOrigin.Begin);
            char[] oneChar = new char[1];
            char ch = (char)sr.Read(oneChar, 0, 1);
            MessageBox.Show(oneChar[0].ToString());

But if the SEEK position is Just after that Special Dot Character, then I Get Junk Character.

So, I get Junk Character if I do Seek to position 11 (i.e. just after the dot position)

sr.BaseStream.Seek(11, SeekOrigin.Begin);

This should give 'X', because the character at 11th position is X.

I think the File contents are legally UTF8.

There is also one more thing, The StreamReader BaseStream length and the StreamReader Contents Length is different.

   MessageBox.Show(sr.BaseStream.Length.ToString());
   MessageBox.Show(sr.ReadToEnd().Length.ToString());

if you're using `StreamReader` you should **never** seek the underlying stream; the reader assumes that *it* now controls the stream, as it maintains an internal buffer; seeking the underlying stream can cause very odd effects; also, `Seek(9...)` means nothing: you can't seek 9 *characters* by seeking in *bytes* unless you're using a single-byte encoding, which: you aren't; you need to ask the `StreamReader` to discard 9 characters — Marc Gravell, Feb 13 '20 at 08:11

Sweeper · Accepted Answer · 2020-02-13T07:56:47.330

2

Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding

It is exactly because of UTF-8 that sr.BaseStream is giving junk characters. :)

StreamReader is a relatively "smarter" stream. It understands how strings work, whereas FileStream (i.e. sr.BaseStream) doesn't. FileStream only knows about bytes.

Since your file is encoded in UTF-8 (a variable-length encoding), letters like A, B and C are encoded with 1 byte, but the • character needs 3 bytes. You can get how many bytes a character needs by doing:

Console.WriteLine(Encoding.UTF8.GetByteCount("•"));

So when you move the stream to "the position just after •", you haven't actually moved past the •, you are just on the second byte of it.

The reason why the Lengths are different is similar: StreamReader gives you the number of characters, whereas sr.BaseStream gives you the number of bytes.

edited Feb 13 '20 at 07:56

answered Feb 13 '20 at 07:28

Sweeper

213,210
22
193
313

So is it better to use some other encoding (like ascii) other than UTF8 which is better than UTF8 so that all characters (even special ones) are taken as single byte. Because I want a smooth one character to one character seek, and it will be almost impossible to take into account the bytes of each character in a text file. But i prefer using UTF8 as i will be encrypting the text file also. – BeeGees Feb 13 '20 at 07:36
@BeeGees You don't have a wide range of character that you can encode if you use ASCII though. UTF-32, which is a fixed-width encoding, is another option. – Sweeper Feb 13 '20 at 07:46
@BeeGees Have a look at [this extension method](https://stackoverflow.com/a/45748714/5133585). You can probably use this to find the location to seek to (albeit not fast)? – Sweeper Feb 13 '20 at 07:53
I tried with UTF32. It does not give normal A, B etc.. Characters. This is because by default a text file without BOM is considrered UTF8 – BeeGees Feb 13 '20 at 07:57
@BeeGees With UTF-32, you have to read 4 bytes at a time, and use `Encoding.UTF32.GetChars` to get the characters. You can't just cast the result from `Read` to a `char`. – Sweeper Feb 13 '20 at 07:59
@BeeGees How is a BOM relevant here? Can't you just _set the encoding of your text file_ directly? – Sweeper Feb 13 '20 at 08:10
When i open the file in Notepad++ and check encoding there, it gives the encoding of file as UTF8 (without BOM), which is right because mostly text files are utf8 – BeeGees Feb 13 '20 at 08:15
Couldn't UTF8 encoding read data be converted to UTF32 in memory and then once we finish handling it like seek etc.. Or inserting text in it, be finally converted to UTF8 and then written to filestream which is a utf8 file – BeeGees Feb 13 '20 at 08:21
I feel like this discussion is a bit beyond the scope of this question. You are now asking “how do I seek to a certain character quickly in an UTF8 file”. I suggest you ask a new question. Make sure you explain how it is different from [this question](https://stackoverflow.com/questions/5404267/streamreader-and-seeking). I might try to answer it if I have time. @BeeGees – Sweeper Feb 13 '20 at 08:37

Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding

1 Answers1

Linked