Decode UTF-16 LE with BOM from ByteArray in C#

Question

I have to extract ID3v2 tags from an MP3 and read the specification. My C# code in general works fine, but I've got a little problem with the text encoding of the tags.

For example, let's take the COMM tag. After the 10 bytes header, the actual text starts with a text encoding byte. Its value is 0x01. So, the following text is UTF-16 LE with a BOM of 0xFFFE. Then the text follows, each character as a two byte representation. My code determines that the decoding is UTF-16 and uses

encoding.Unicode.GetString(_buffer, startpos, length of bytes);

The result is the correct string, but at the beginning, I find a symbol/character U+FFFE, which represents the BOM. I wonder why this remains in the decoded string and I have to remove it afterwards.

I already checked the MSDN article for UnicodeEncoding() class and tested the sample code, where a string with some Unicode characters will be encoded and written together with a preamble into a text file. This textfile will afterwards be opened again, the string will be decoded and the result is just the text, without the BOM character. I suspect the difference is that my code gets the tags from a ByteArray, while the MS sample code gets it from a text file. However, the text looks the same, when I open it with a hex editor.

My question: Why does the BOM character remain in the decoded string when I decode it from a ByteArray?

Because it was put in the data before the data was Encoded. BOM is probably indicating it is the beginning of the data. — jdweng, Feb 21 '22 at 11:26
No, `0x01` only indicates "Unicode" (implying UTF-16, but strictly also UCS-2) but not any byte order - that's the reason why a BOM should be used/expected for the actual byte order of the text. Respect it, then just cut off the first 2 bytes if they're a BOM. — AmigoJack, Apr 03 '22 at 09:36

Decode UTF-16 LE with BOM from ByteArray in C#

0 Answers0