3

I have a test program that demonstrates the end result that I am hoping for (even though in this test program the steps may seem unnecessary).

The program compresses data to a file using GZipStream. The resulting compressed file is C:\mydata.dat.

I then read this file, and write it to a new file.

//Read original file
string compressedFile = String.Empty;
using (StreamReader reader = new StreamReader(@"C:\mydata.dat"))
{
    compressedFile = reader.ReadToEnd();
    reader.Close();
    reader.Dispose();
}

//Write to a new file
using (StreamWriter file = new StreamWriter(@"C:\mynewdata.dat"))
{
    file.WriteLine(compressedUserFile);
}

When I try to decompress the two files, the original one decompresses perfectly, but the new file throws an InvalidDataException with message The magic number in GZip header is not correct. Make sure you are passing in a GZip stream.

Why are these files different?

jkh
  • 3,618
  • 8
  • 38
  • 66

2 Answers2

3

StreamReader is for reading a sequence of characters, not bytes. The same applies to StremWriter. Since treating compressed files as a stream of characters doesn't make any sense, you should use some implementation of Stream. If you want to get the stream as an array of bytes, you can use MemoryStream.

The exact reason why using character streams doesn't work is that they assume the UTF-8 encoding by default. If some byte is not valid UTF-8 (like the second byte of the header, 0x8B), it's represented as Unicode “replacement character” (U+FFFD). When the string is written back, that character is encoded using UTF-8 into something completely different than what was in the source.

For example, to read a file from a stream, get it as an array of bytes and then write it to another files as a stream:

byte[] bytes;
using (var fileStream = new FileStream(@"C:\mydata.dat", FileMode.Open))
using (var memoryStream = new MemoryStream())
{
    fileStream.CopyTo(memoryStream);
    bytes = memoryStream.ToArray();
}

using (var memoryStream = new MemoryStream(bytes))
using (var fileStream = new FileStream(@"C:\mynewdata.dat", FileMode.Create))
{
    memoryStream.CopyTo(fileStream);
}

The CopyTo() method is only available in .Net 4, but you can write your own if you use older versions.

Of course, for this simple example, there is no need to use streams. You can simply do:

byte[] bytes = File.ReadAllBytes(@"C:\mydata.dat");
File.WriteAllBytes(@"C:\mynewdata.dat", bytes);
Community
  • 1
  • 1
svick
  • 236,525
  • 50
  • 385
  • 514
-1

EDIT: Apparently, my suggestions are wrong/invalid/whatever... please use one of the others which have no doubt been highly re-factored to the point where no extra performance could be possible be achieved (else, that would mean they are just as invalid as mine)

using (System.IO.StreamReader sr = new System.IO.StreamReader(@"C:\mydata.dat"))
{
    using (System.IO.StreamWriter sw = new System.IO.StreamWriter(@"C:\mynewdata.dat"))
    {
        byte[] bytes = new byte[1024];
        int count = 0;
        while((count = sr.BaseStream.Read(bytes, 0, bytes.Length)) > 0){
            sw.BaseStream.Write(bytes, 0, count);
        }
    }
}

Read all bytes

byte[] bytes = null;
using (System.IO.StreamReader sr = new System.IO.StreamReader(@"C:\mydata.dat"))
{
    bytes = new byte[sr.BaseStream.Length];
    int index = 0;
    int count = 0;
    while((count = sr.BaseStream.Read(bytes, index, 1024)) > 0){
        index += count;
    }
}

Read all bytes/write all bytes (from svick's answer):

byte[] bytes = File.ReadAllBytes(@"C:\mydata.dat");
File.WriteAllBytes(@"C:\mynewdata.dat", bytes);

PERFORMANCE TESTING WITH OTHER ANSWERS:

Just did a quick test between my Answer (StreamReader) (first part above, file copy) and svick's answer (FileStream/MemoryStream) (the first one). The test is 1000 iterations of the code, here are the results from 4 tests (results are in whole seconds, all actual result where slightly over these values):

My Code | svick code
--------------------
9       | 12
9       | 14
8       | 13
8       | 14

As you can see, in my test at least, my code performed better. One thing perhaps to note with mine is I am not reading a character stream, I am in fact accessing the BaseStream which is providing a byte stream. Perhaps svick's answer is slow because he is using two streams for reading, then two for writing. Of course, there is a lot of optimisation that could be done to svick's answer to improve the performance (and he also provided an alternative for simple file copy)

Testing with third option (ReadAllBytes/WriteAllBytes)

My Code | svick code | 3rd
----------------------------
8       | 14         | 7
9       | 18         | 9
9       | 17         | 8
9       | 17         | 9

Note: in milliseconds the 3rd option was always better

musefan
  • 47,875
  • 21
  • 135
  • 185
  • Thanks for the response...but what if I did not have access to both files at the same time, and I needed some sort of temporary storage...could I just read it into a byte[]? – jkh Jul 13 '11 at 15:23
  • `Read()` is not required to return all the bytes in the input stream, even if they fit in the array. For example, with `NetworkStream`, this happens regularly. But all other `Stream`s are allowed to do the same. If you want to use `Read()` this way, you have to make sure there are no more bytes to read. – svick Jul 13 '11 at 15:48
  • 2
    Why do you even create `StreamReader` when reading bytes? Just use `FileStream` directly. – svick Jul 13 '11 at 16:00
  • I went with what the OP had used, and to be honest, I couldn't remember the ins and outs of memory/file stream. I up'd your answer anyway.... but I think the down votes on mine are way unfair as I have provided a valid (albeit alternative) solution – musefan Jul 13 '11 at 16:13
  • Another valid (albeit alternative) solution would be the C# equivalent of `system("copy mydata.dat mynewdata.dat")`. Doesn't mean it shouldn't be downvoted :P If the OP's code is wrong, how is "going with what the OP used" a helpful answer? – anton.burger Jul 13 '11 at 16:17
  • @Anton: What makes the code wrong? Can you show that one is better (performance) than another? – musefan Jul 13 '11 at 16:20
  • 1
    svick already mentioned you should just be using `Stream` (bytes) rather than `StreamReader` (characters). Performance is absolutely irrelevant to the discussion; the point is that you admit to having posted a misleading code sample with no explanation, without taking the time to look up the details. – anton.burger Jul 13 '11 at 16:28
  • No, I said I was aware of FileStream and MemoryStream. I am in no way stating that my code is bad or wrong. Working code is valid code. And performance is EXACTLY the reason to determine if you should use one method or another – musefan Jul 13 '11 at 16:39
  • Your "being aware" of them doesn't help the OP in any way, unless it's actually in your answer. Your answer is misleading because you didn't do anything to address the OP's misconceptions about reading files with `Stream` vs. `StreamReader`. I'll remove _my_ downvote if you do. – anton.burger Jul 13 '11 at 16:46
  • @musefan, performance is one of the reasons why choose one way to write a piece of code over another. But it's certainly not the only one, and, in most cases, also not the most important one. As Anton said, using character streams to read bytes, even in a correct way, is misleading and potentially confusing for someone who knows the difference. (He's using `StreamReader` instead of `FilesStream`? Why? There has to be some hidden reason, right?) – svick Jul 13 '11 at 17:56
  • @svick: I don't know his reasons and maybe the OP does have one, but I can only work with what I have in the question. I think what seems to be the case here is that "alternate" solution are wrong and there should only ever be one answer per question asked? (unless of course people just copy and paste other peoples answers) I have taken on board what has been said, and in terms of performance I am not just going to agree with you without seeing for myself that your way is better (which I don't have time to test) - sometimes, it would seem one way has better performance,but it is not always so – musefan Jul 14 '11 at 11:26
  • @Anton: There you go, I have informed in my answer that it may not be the best approach based on other answers. And well we are on the point of inform the OP of bad practices, I don't see anyone saying that you don't need to put close/dispose in a using – musefan Jul 14 '11 at 11:30
  • Wow, good grace and not a hint of sarcasm, either :P I'm sorry that we rubbed each other up the wrong way, musefan; I should just have suggested edits to your answer instead of getting into an argument with you. – anton.burger Jul 14 '11 at 12:34
  • @Anton: I hold no grudges. I also did a performance test, see my edited answer – musefan Jul 14 '11 at 12:50