14

4I must write strings to a binary MIDI file. The standard requires one to know the length of the string in bytes. As I want to write for mobile as well I cannot use AnsiString, which was a good way to ensure that the string was a one-byte string. That simplified things. I tested the following code:

TByte = array of Byte;

function TForm3.convertSB (arg: string): TByte;
var
   i: Int32;
begin
   Label1.Text := (SizeOf (Char));
   for i := Low (arg) to High (arg) do
   begin
      label1.Text := label1.Text + ' ' + IntToStr (Ord (arg [i]));
   end;
end; // convert SB //

convertSB ('MThd');

It returns 2 77 84 104 100 (as label text) in Windows as well as Android. Does this mean that Delphi treats strings by default as UTF-8? This would greatly simplify things but I couldn't find it in the help. And what is the best way to convert this to an array of bytes? Read each character and test whether it is 1, 2 or 4 bytes and allocate this space in the array? For converting back to a character: just read the array of bytes until a byte is encountered < 128?

Arnold
  • 4,578
  • 6
  • 52
  • 91
  • 1
    @Tlama - That was on purpose :-) When writing a MIDI file I can just organize that that is the case. I wondered what Delphi was doing when using just UTF-8 characters, would it automatically change that to a representation of two bytes? It does not and -interestingly- it does so consistently for Windows and Android. – Arnold Jan 29 '14 at 21:24
  • actually the characters in you code in the question are indeed two bytes wide. That's what the call to SizeOf told you. – David Heffernan Jan 29 '14 at 22:42

1 Answers1

51

Delphi strings are encoded internally as UTF-16. There was a big clue in the fact that SizeOf(Char) is 2.

The reason that all your characters had ordinal in the ASCII range is that UTF-16 extends ASCII in the sense that characters 0 to 127, in the ASCII range, have the same ordinal value in UTF-16. And all your characters are ASCII characters.

That said, you do not need to worry about the internal storage. You simply convert between string and byte array using the TEncoding class. For instance, to convert to UTF-8 you write:

bytes := TEncoding.UTF8.GetBytes(str);

And in the opposite direction:

str := TEncoding.UTF8.GetString(bytes);

The class supports many other encodings, as described in the documentation. It's not clear from the question which encoding you are need to use. Hopefully you can work the rest out from here.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • 2
    I want to read and write the file solely in UTF-8 which you already guessed. TEncoding is a great class that solves a lot of my string format questions. My question was inspired by the fear that I had to do al the encoding myself, hence the question about the internal representation. – Arnold Jan 29 '14 at 21:42