5

When writing a string to a binary file using C#, the length (in bytes) is automatically prepended to the output. According to the MSDN documentation this is an unsigned integer, but is also a single byte. The example they give is that a single UTF-8 character would be three written bytes: 1 size byte and 2 bytes for the character. This is fine for strings up to length 255, and matches with the behaviour I've observed.

However, if your string is longer than 255 bytes, the size of the unsigned integer grows as necessary. As a simple example, consider 1024 characters as:

string header = "ABCDEFGHIJKLMNOP";
for (int ii = 0; ii < 63; ii++)
{
  header += "ABCDEFGHIJKLMNOP";
}
fileObject.Write(header);

results in 2-bytes prepending the string. Creating a 2^17 length string results in a somewhat maddening 3-byte array.

The question, therefore, is how to know how many bytes to read to get the size of what follows when reading? I wouldn't necessarily know a priori the header size. Ultimately, can I force the Write(string) method to always use a consistent size (say 2 bytes)?

A possible workaround is to write my own write(string) method, but I would like to avoid that for obvious reasons (similar questions here and here accept this as an answer). Another more palatable workaround is to have the reader look for a specific character that starts the ASCII string information (maybe an unprintable character?), but that is not infallible. A final workaround (that I can think of) would be to force the string to be within the range of sizes for a particular number of size bytes; again, that is non ideal.

While forcing the size of the byte array to be consistent is the easiest, I have control over the reader so any clever reader solutions are also welcome.

Andy K.
  • 355
  • 1
  • 4
  • 14
  • 1
    It uses a [variable-length 7-bit encoding](https://stackoverflow.com/a/31501941/17034). A micro-optimization, very little reason to be mad about it. If you don't like it then consider Encoding.UTF8.GetBytes() but don't forget to also serialize the length of the byte[] array so you can properly read it back. Don't use 7-bit encoding, hehe. – Hans Passant Nov 21 '17 at 09:27
  • So you are reading that binary file not with `BinaryReader`? – Evk Nov 21 '17 at 09:28
  • There is no guarantee that the file will forever be read with binary reader. – Andy K. Nov 21 '17 at 09:30
  • To echo @Evk's point: it is only *intended* that `BinaryWriter` is easy to consume via `BinaryReader`; neither is intended for *general purpose* binary IO. So if you're trying to write a specific protocol: *don't use `BinaryWriter`* - unless it happens to be a 100% match – Marc Gravell Nov 21 '17 at 09:30
  • 1
    Are you SURE strings of length between 128 and 255 are actually storing the length as a single byte? – Matthew Watson Nov 21 '17 at 09:30
  • 1
    @MatthewWatson I'm sure that they aren't :) – Marc Gravell Nov 21 '17 at 09:30
  • @MatthewWatson No, I didn't check between 128 and 255, good point. I checked something small and something around 1024. UTF-8 vs ASCII error on my part. – Andy K. Nov 21 '17 at 09:31
  • @HansPassant You're right of course. The root of my question from a philosophical standpoint is: what's the point of a variable-length byte array? If I don't know the length of the string, I probably don't know the length of the byte array describing it, so it seems that it's useless. How do I know when to stop reading bytes to get the size of the string? Am I thinking about this incorrectly? – Andy K. Nov 21 '17 at 09:33
  • 1
    @AndyK. in that encoding each byte has information about if there is another byte (that's why it's 7-bit encoding - last bit is used for that). So you read 1 byte, then check that bit and decide if you need to read next byte or not. That means you can always read string length, even though that length is encoded in variable-length array. – Evk Nov 21 '17 at 09:36
  • You can only read it back correctly when you either know the length or have a special "end-of-string" character. Since the framework always knows the length of a string, and abhors abusing special character codes like the C language does, that is what it uses. It is variable length since storing the length can take between 1 and 4 bytes. Usually 1. You'd know the length of the byte[] array as well, it is Length. – Hans Passant Nov 21 '17 at 09:37
  • "what's the point of a variable-length byte array?" - efficiency; a lot of data is composed of multiple **short** strings (human names, company names, product names / codes, address lines, status strings, guid strings etc); taking a single byte to encode these saves a surprising amount of space in a large file. When you have a long string, the extra bytes that variable-length encoding takes is **irrelevant** compared to the size of the string. As for how to stop reading bytes: you need to reverse the same operation that the encoder used - in this case, by reading single bytes until MSB is 0. – Marc Gravell Nov 21 '17 at 09:38
  • @Evk Ah, that's where my error is. I didn't appreciate the use of 7-bit encoding. With that, this makes sense. – Andy K. Nov 21 '17 at 09:40
  • 1
    @AndyK. here's the reference source for `Read7BitEncodedInt`: https://referencesource.microsoft.com/#mscorlib/system/io/binaryreader.cs,582 – Marc Gravell Nov 21 '17 at 09:42

2 Answers2

3

BinaryWriter and BinaryReader aren't the only way of writing binary data; simply: they provide a convention that is shared between that specific reader and writer. No, you can't tell them to use another convention - unless of course you subclass both of them and override the ReadString and Write(string) methods completely.

If you want to use a different convention, then simply: don't use BinaryReader and BinaryWriter. It is pretty easy to talk to a Stream directly using any text Encoding you want to get hold of the bytes and the byte count. Then you can use whatever convention you want. If you only ever need to write strings up to 65k then sure: use fixed 2 bytes (unsigned short). You'll also need to decide which byte comes first, of course (the "endianness").

As for the size of the prefix: it is essentially using:

int byteCount = this._encoding.GetByteCount(value);
this.Write7BitEncodedInt(byteCount);

with:

protected void Write7BitEncodedInt(int value)
{
    uint num = (uint) value;
    while (num >= 0x80)
    {
        this.Write((byte) (num | 0x80));
        num = num >> 7;
    }
    this.Write((byte) num);
}

This type of encoding of lengths is pretty common - it is the same idea as the "varint" that "protobuf" uses, for example (base-128, least significant group first, retaining bit order in 7-bit groups, 8th bit as continuation)

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Referring to them as a convention makes a lot of sense, and therefore why you cannot change them to fit your needs without overriding the methods. Your comment above strengthens that point, and is a paradigm shift on how I think about them. – Andy K. Nov 21 '17 at 09:30
  • @AndyK. to be honest, it sounds like you should be dealing with `Stream` directly... – Marc Gravell Nov 21 '17 at 09:31
  • I am writing human-readable header information to a data file, and it was just so tempting to use a very simple write(string) method which, on the surface, did everything I wanted. I think you're right. – Andy K. Nov 21 '17 at 09:34
2

If you want to write the length yourself:

using (var bw = new BinaryWriter(fs))
{
  bw.Write(length); // Use a byte, a short...
  bw.Write(Encoding.Unicode.GetBytes("Your string"));
}
Maxence
  • 12,868
  • 5
  • 57
  • 69