Reading a "string in little-endian UTF-16 encoding" with BinaryReader

Question

I am following this specification of this file format: https://github.com/rouault/dump_gdbtable/wiki/FGDB-Spec

utf16: string in little-endian UTF-16 encoding

How do I read this? I tried BinaryReader.ReadString() however it returns something along the lines of:

"\0e\0y\0w\0o\0r\0d\0\0 \0\0\0\0\rP\0a\0r\0a\0m\0e\0t\0e\0r\0N\0a\0m\0e\0\0 \0\0\0\0\fC\0o\0n\0f\0i\0g\0S\0t\0r\0"

That definitely isn't right.

From the specification:

ubyte: number of UTF-16 characters (not bytes) of the name of the field
utf16: name of the field
ubyte: number of UTF-16 characters (not bytes) of the alias of the field. Might be 0
utf16: alias of the field (ommitted if previous field is 0)
ubyte: field type ( 0 = int16, 1 = int32, 2 = float32, 3 = float64, 4 = string, 5 = datetime, 6 = objectid, 7 = geometry, 8 = binary, 9=raster, 10/11 = UUID, 12 = XML )

Could I somehow use the number of UTF-16 characters to read the name of the field?

How do you construct the `BinaryReader`? Are you using an overload where you specify the encoding of the text? — Damien_The_Unbeliever, Aug 01 '14 at 14:20
Normally you specify encoding, but on [this](http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx) page there are no little `endian utf-16`, perhaps you have to make own encoding somehow (or one of them **is** what you need, not sure). — Sinatr, Aug 01 '14 at 14:23
BinaryReader br = new BinaryReader(File.Open("C:\\florida.gdb\\a00000002.gdbtable", FileMode.Open, FileAccess.Read, FileShare.Read | FileShare.Delete)); — Evan Parsons, Aug 01 '14 at 14:25
@Sinatr - there is such an encoding. It helps to know that in the Windows world, `Unicode` means UTF-16. — Damien_The_Unbeliever, Aug 01 '14 at 14:28

ulrichb · Accepted Answer · 2014-08-01T14:46:47.877

2

BinaryReaders ReadString() method doesn't provide an overload where you can specify the string length (instead it assumes an encoded prefixed length, which doesn't match the format of the spec you linked).

Therefore, you cannot use ReadString() directly, but you can

use ReadByte() to get the string (character) length,
multiply it by 2,
use ReadBytes(count),
use Encoding.Unicode.GetString(bytes).

edited Aug 01 '14 at 14:46

answered Aug 01 '14 at 14:37

ulrichb

19,610
8
73
87

Is multiplying by two necessary? When I do it, it returns something similar to the below answer, except more chinese/japanese characters after it: code sample bit = int count = (br.ReadByte() * 2) ; byte[] array = br.ReadBytes(count); field.nameOfField = Encoding.Unicode.GetString(array); – Evan Parsons Aug 01 '14 at 16:06
Spec says number of charachters, not bytes. Since Encoding.Unicode is 16 bits (2bytes per char) you want to multiply with 2. You might want to provide code in your question how you try to read the string. – CSharpie Aug 01 '14 at 16:09
aha! I think that's it! It returns "Keyword" which I believe is the name of the field. – Evan Parsons Aug 01 '14 at 16:17

score 1 · Answer 2 · answered Aug 01 '14 at 14:29

1

It should be:

BinaryReader br = new BinaryReader(File.Open("C:\\florida.gdb\\a00000002.gdbtable",
                                   FileMode.Open,
                                   FileAccess.Read,
                                   FileShare.Read | FileShare.Delete),
                      Encoding.Unicode);

Where Encoding is System.Text.Encoding.

For various historical reasons, Microsoft/Windows refer to UTF-16 (and, specifically, the little-endian variant) as "Unicode" rather than UTF-16.

answered Aug 01 '14 at 14:29

Damien_The_Unbeliever

234,701
27
340
448

It returns "攀礀眀漀爀搀\0 \0ЀഀParameterNameЀ \0䌌漀渀昀椀最匀琀爀" when I switch it to your coding. Would I have to strip out the other characters? I'd do that, but I'm afraid of losing them when I go to save it again. – Evan Parsons Aug 01 '14 at 14:39
If you get that in return something is almost certainly wrong. – Lasse V. Karlsen Aug 01 '14 at 14:59
The Fileformat doesnt work like this! You have to read the bytes at the specific Offset and then interpret them as unicode. – CSharpie Aug 01 '14 at 16:05

Reading a "string in little-endian UTF-16 encoding" with BinaryReader

2 Answers2

Linked

Related