4

I am playing around with the Marshal class in C# and am slightly confused by the result of this operation:

string someVal = "Hello There";
IntPtr ptS = Marshal.StringToHGlobalAnsi(someVal);
char* ptsPt = (char*)ptS.ToPointer();

After looking at ptsPt[0] in the immediate window it contains this value: '效'

I am guessing it has something to do with the StringToHGlobalAnsiMethod treating the managed chars as 8 bit values, but when they really they are 16 bit. But I cannot quite understand why this is happening.

I know I can get around this issue by changing it to StringToHGlobalUni. But I don't understand why this is!

Cheers

William
  • 1,837
  • 2
  • 22
  • 36

1 Answers1

5

It's because in C#, char is a 16-bit wide type. StringToHGlobalAnsi converts the string to ANSI, that is 1 byte per character. Then you look at ptsPt[0], which is interpreted to contain both of the first two ANSI characters.

Here's what the original string look like in memory:

00 48 00 65 00 6C 00 6C 00 6F 00 20 ...

This is because C# strings are stored in UTF-16, and the above is UTF-16 for "Hello There".

After the call to StringToHGlobalAnsi, a new piece of memory is allocated, containing these bytes:

48 65 6C 6C 6F 20 ...

(and incidentally, this means you should free it with Marshal.FreeHGlobal when you're done).

Then, when you get a char* to this, the first char pointed to comprises the bytes 48 65, which due to little endianness really means 0x6548, which stands for the character 效.

Roman Starkov
  • 59,298
  • 38
  • 251
  • 324
  • Great answer, cheers! I was getting 25928, 效 - which I guess is the decimal representation of the hex value 0x6548. Just a quick question about how the string looks in memory though - why is do we need to have that additional byte on c# managed chars? 48 00 is 'H'. Is it so we can store additional characters - like the asian symbol we saw early (sorry I don't know what language it's from!) – William Jul 12 '15 at 21:42
  • @William as a first approximation, yes, the reason is to expand the number of characters that can be represented by a single `char` value. Ultimately though it's a leftover from those times when it was thought that 65536 different characters will be enough for everything (Unicode v1.*). – Roman Starkov Jul 13 '15 at 10:20