I have to extract ID3v2 tags from an MP3 and read the specification. My C# code in general works fine, but I've got a little problem with the text encoding of the tags.
For example, let's take the COMM
tag. After the 10 bytes header, the actual text starts with a text encoding byte. Its value is 0x01
. So, the following text is UTF-16 LE with a BOM of 0xFFFE
. Then the text follows, each character as a two byte representation. My code determines that the decoding is UTF-16 and uses
encoding.Unicode.GetString(_buffer, startpos, length of bytes);
The result is the correct string, but at the beginning, I find a symbol/character U+FFFE, which represents the BOM. I wonder why this remains in the decoded string and I have to remove it afterwards.
I already checked the MSDN article for UnicodeEncoding()
class and tested the sample code, where a string with some Unicode characters will be encoded and written together with a preamble into a text file. This textfile will afterwards be opened again, the string will be decoded and the result is just the text, without the BOM character.
I suspect the difference is that my code gets the tags from a ByteArray
, while the MS sample code gets it from a text file.
However, the text looks the same, when I open it with a hex editor.
My question:
Why does the BOM character remain in the decoded string when I decode it from a ByteArray
?