3

I need to write bytes of an IEnumerable<byte> to a file.
I can convert it to an array and use Write(byte[]) method:

using (var stream = File.Create(path))
    stream.Write(bytes.ToArray());

But since IEnumerable doesn't provide the collection's item count, using ToArray is not recommended unless it's absolutely necessary.

So I can just iterate the IEnumerable and use WriteByte(byte) in each iteration:

using (var stream = File.Create(path))
    foreach (var b in bytes)
        stream.WriteByte(b);

I wonder which one will be faster when writing lots of data.

I guess using Write(byte[]) sets the buffer according to the array size so it would be faster when it comes to arrays.

My question is when I just have an IEnumerable<byte> that has MBs of data, which approach is better? Converting it to an array and call Write(byte[]) or iterating it and call WriteByte(byte) for each?

Community
  • 1
  • 1
Şafak Gür
  • 7,045
  • 5
  • 59
  • 96
  • 1
    Why don't you measure it for yourself? What kind of `IEnumerable` do you have? How likely it is that it also implements other collection interfaces (like `IList`)? – svick Sep 22 '12 at 16:35

2 Answers2

3

Enumerating over a large stream of bytes is a process that adds tons of overhead to something that is normally cheap: Copying bytes from one buffer to the next.

Normally, LINQ-style overhead does not matter much but when it comes to processing 100 million bytes per second on a normal hard drive you will notice severe overheads. This is not premature optimization. We can foresee that this will be a performance hotspot so we should eagerly optimize.

So when copying bytes around you probably should not rely on abstractions like IEnumerable and IList at all. Pass around arrays or ArraySegement<byte>'s which also contain Offset and Count. This frees you from slicing arrays too often.

One thing that is a death-sin with high-throughput IO, too, is calling a method per byte. Like reading bytewise and writing bytewise. This kills performance because these methods have to be called hundreds of millions of times per second. I have experienced that myself.

Always process entire buffers of at least 4096 bytes at a time. Depending on what media you are doing IO with you can use much larger buffers (64k, 256k or even megabytes).

usr
  • 168,620
  • 35
  • 240
  • 369
  • 1
    `FileStream` has an internal buffer of 4096 bytes by default. `WriteByte` writes to that buffer and only flushes it when full. Flushing also happens when you switch between reading and writing, when you call `Flush` or when you close the stream. Iterating over the enumerator, on the other hand, needs to happen anyway. – Wormbo Sep 23 '12 at 07:02
  • 1
    @Wormbo you don't even want a single virtual function call per byte (and Write internally probably does a little more than just setting that byte). Even that turns out to be very expensive when writing to a disk at full sequential speed. All you want to do is block-copy which is extremely fast. – usr Sep 23 '12 at 08:35
  • I wonder what "lots of data" really means. Are we talking about tens of thousands or even millions of bytes? I did some testing and around 19 to 20 million bytes, the `ToArray()` overhead starts outweighing the `WriteByte()` overhead on my machine. A fixed-size separate buffer outperforms both approaches almost all the time, though. – Wormbo Sep 23 '12 at 10:41
  • @Wormbo yes calling ToArray before calling write is useless because ToArray enumerates the sequence. The overhead just moved to a different place, but it is roughly the same amount. Once you have a sequence all is lost. One must stay with byte-arrays all the way. Try measuring how long it takes to write a `new byte[1024 * 1024 * 64]`. Measure CPU time. – usr Sep 23 '12 at 11:03
1

You should profile which version is faster. The FileStream class has an internal buffer that decouples the Read() and Write() methods a bit from the actual file system accesses.

If you don't specify a buffer size in the FileStream constructor, it uses something like 4096 Bytes of buffer by default. That buffer will combine many of your WriteByte() calls into one write to the underlying file. The only question is whether the overhead of the WriteByte() calls will exceed the overhead of the Enumerable.ToArray() call. The latter definitely will use more memory, but you always have to deal with this sort of trade-off.

FYI: The current .NET 4 implementation of Enumerable.ToArray() involves growing an array by duplicating its size whenever necessary. Each time it grows, all values are copied over. Also, when all items are stored in the array, its content is copied again to an array of the final size. For IEnumerable<T> instances that actually implement ICollection<T>, the code takes advantage of that fact to start with the correct array size and let the collection to the copying instead.

Wormbo
  • 4,978
  • 2
  • 21
  • 41