13

I'm working on porting some Delphi 7 code to XE4, so, unicode is the subject here.

I have a method where a string gets written to a TMemoryStream, so according to this embarcadero article, I should multiply the length of the string (in characters) times the size of the Char type to get the length in bytes that is needed for the length (in bytes) parameter to WriteBuffer.

so before:

rawHtml : string; //AnsiString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml);

after:

rawHtml : string; //UnicodeString
...
memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));

My understanding of Delphi's UnicodeString type is that it's UTF-16 internally. But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes. Another of embarcadero's articles seems to confirm that my suspicions, "In fact, it isn’t even always true that one Char is equal to two bytes!"

So...that leaves me wondering whether Length(rawHtml)* SizeOf(Char) is really going to be robust enough to be consistently accurate, or whether there's a better way to determine the size of the string that will be more accurate?

Jessica Brown
  • 8,222
  • 7
  • 46
  • 82
  • 5
    why don't you use `TStringStream` instead of `TMemoryStream`? – teran May 13 '13 at 19:55
  • Ultimately the MemoryStream is passed to a TWebBrowser component to display. Pretty much every example I've seen of that has used MemoryStream. Would StringStream be a better choice for that purpose? – Jessica Brown May 13 '13 at 19:59
  • @Jessica In the end, they're both based on a `TStream` which means the internal structure of both work the same - it's just how you interact with it that's different. So even a `TFileStream` or `TResourceStream` are applicable to use in your case, that is, if you were sending Files or Resources to your browser anyway. – Jerry Dodge May 13 '13 at 21:34
  • 2
    It still hurts that Delphi didn't just use UTF8 internally. – Roddy May 13 '13 at 21:56
  • TStringStream is TMemoryStream descendant, so it makes sense to replace – OnTheFly May 13 '13 at 22:04
  • @Roddy Delphi followed its platform, which chose its path before UTF8 was even invented. – David Heffernan May 13 '13 at 22:42
  • @Roddy: I'm not sure it hurts at all. With UTF-16 you know that almost all characters are two bytes (yeah, the vast majority of them take up twice as much space as they have to, but space is seldom an issue these days). With UTF-8 you don't know that, unless you are strictly confined to English letters and punctuation. – Andreas Rejbrand May 13 '13 at 22:55
  • @Andreas What use is knowing that most code points are a single character element? You still have to code for generality. UTF8 has lots of advantages. Had UTF8 been invented before NT and Java then I bet there would be no UTF16. – David Heffernan May 14 '13 at 06:35
  • @AndreasRejbrand : http://www.utf8everywhere.org/ @ David : yes, MS could have supported CP_UTF8 ins the 'fooA' API calls (in fact, they still could). I understand how we got to where we are, but it feels like a bit of a missed opportunity. – Roddy May 14 '13 at 08:33
  • @David: Yes, I guess you are right. Sorry. – Andreas Rejbrand May 14 '13 at 08:51

4 Answers4

11

Delphi's UnicodeString is encoded with UTF-16. UTF-16 is a variable length encoding, just like UTF-8. In other words, a single Unicode code point may require multiple character elements to encode it. As a point of interest, the only fixed length Unicode encoding is UTF-32. The UTF-16 encoding uses 16 bit character elements, hence the name.

In a Unicode Delphi, Char is an alias for WideChar which is a UTF-16 character element. And string is an alias for UnicodeString, which is an array of WideChar elements. The Length() function returns the number of elements in the array.

So, SizeOf(Char) is always 2 for UnicodeString. Some Unicode code points are encoded with multiple character elements, or Chars. But Length() returns the number of characters elements and not the number of code points. The character elements all have the same size. So

memorystream1.WriteBuffer(Pointer(rawHtml)^, Length(rawHtml)* SizeOf(Char));

is correct.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
8

My understanding of Delphi's UnicodeString type is that it's UTF-16 internally.

You are correct about UTF-16 encoding of Delphi's UnicodeString. This means what one 16-bit character is wide enough to represent all code points from the Basic Multilingual Plane as exactly one Char element of string array.

But my general understanding of Unicode is that not all unicode characters can be represented even in 2 bytes, that some corner case foreign characters will take 4 bytes.

However, you've got a little misconception here. Length function does not perform any deep inspection of characters and simply returns number of 16-bit WideChar elements, without taking into account any surrogates within your string. This means what if you assign a single character from any of Supplementary Planes to the UnicodeString, Length will return 2.

program Egyptian;

{$APPTYPE CONSOLE}

var
  S: UnicodeString;

begin
  S := #$1304E;  // single char
  Writeln(Length(S));
  Readln;
end.

Conclusion: byte size of string data is always fixed and equals Length(S) * SizeOf(Char), no matter if S contains any variable-length characters.

OnTheFly
  • 2,059
  • 5
  • 26
  • 61
6

Others have explained how UnicodeString is encoded and how to calculate its byte length. I just want to mention that the RTL already has such a function - SysUtils.ByteLength():

memorystream1.WriteBuffer(PChar(rawHtml)^, ByteLength(rawHtml));
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • This is a really badly designed function mind you. It will accept strings other than UnicodeString but return rather useless values. Think through what happens when you pass it a UTF8String. I QC'ed this to no avail. – David Heffernan May 14 '13 at 04:09
  • I read your QC report. Your proposed "solution" is not any better, because passing a `UnicodeString` to a `RawByteString` still performs a data conversion, this time from UTF-16 to Ansi, which can be lossy. `RawByteString` does not preserve `UnicodeString` data, only `AnsiString(N`) data. The correct solution is to overload `ByteLength()` on **both** `UnicodeString` and `RawByteString`, like other RTL functions do. – Remy Lebeau May 14 '13 at 08:05
  • You are quite right. I will fix my QC report. It's probably a waste of time though because the report has never even been opened. – David Heffernan May 14 '13 at 08:08
  • After you fix it, I can push it up to the next level. – Remy Lebeau May 14 '13 at 08:45
3

What you are doing is correct (with the sizeof(Char)).

What you refer to is that not one character refers to one code point (due to surrogate pairs for example). But the USC2 encoded (NOT UTF-16) characters in the string take up exactly the amount of bytes with Length( Str ) * sizeof( Char ).

Note that the Unicode encoding used in Delphi is the same as all Windows API call expect in the ....W variants.

Ritsaert Hornstra
  • 5,013
  • 1
  • 33
  • 51
  • What are you talking about? The question is about UTF16 and not about UCS2. – David Heffernan May 13 '13 at 21:05
  • In UnicodeString UTF-16 is used, not the older UCS-2. So a code point can be made up of either one or two Chars. But as David explained, a surrogate pair is two Chars, and Length counts the number of Char elements, not the number of code points. – Rudy Velthuis May 13 '13 at 21:17
  • Windows strings have been UTF-16 since Windows 2000 – afrazier May 13 '13 at 22:51