3

I have an AnsiString and I need to convert it in the most efficient way to a TBytes. How can I do that ?

Kromster
  • 7,181
  • 7
  • 63
  • 111
zeus
  • 12,173
  • 9
  • 63
  • 184

2 Answers2

10

The function BytesOf converts an AnsiString to TBytes.

var
  A: AnsiString;
  B: TBytes;
begin
  A := 'Test';
  B := BytesOf(A);

  // convert it back
  SetString(A, PAnsiChar(B), Length(B));
end;
Sebastian Z
  • 4,520
  • 1
  • 15
  • 30
  • @DavidHeffernan No, but that wasn't the OP's question :-) – HeartWare Jan 19 '18 at 10:49
  • 1
    @HeartWare Read the question title again – David Heffernan Jan 19 '18 at 11:10
  • That was pretty much hidden in the title. I added a note about converting it back. – Sebastian Z Jan 19 '18 at 13:34
  • Indeed. I only added that bit as an edit after belatedly spotting it in the title. – David Heffernan Jan 19 '18 at 13:57
  • 2
    Note that `BytesOf()` crashes if the input string is empty, because it doesn't check for `Length=0` before indexing into both the bytes and string. Using `SetLength()+Move()` with `Pointer` typecasts like David showed does not suffer from that issue. Also note that going the other way, using `SetString()` is simpler to use then using `SetLength()+Move()` explicitly: `SetString(ansiStr, PAnsiChar(bytes), Length(bytes));` – Remy Lebeau Jan 19 '18 at 19:00
  • Yes, `SetString()` is better. I've updated the code. `BytesOf('')` doesn't crash for me. Am I just being lucky? – Sebastian Z Jan 19 '18 at 22:15
  • @remy The RTL is compiled without range checking. So BytesOf won't fail on an empty string. One of the nuances of the code in my answer is that it avoids calls to UniqueString. Both BytesOf and your SetString call won't. – David Heffernan Jan 21 '18 at 08:28
7

Assuming you want to retain the same encoding you can do this

SetLength(bytes, Length(ansiStr));
Move(Pointer(ansiStr)^, Pointer(bytes)^, Length(ansiStr));

In reverse it goes

SetLength(ansiStr, Length(bytes));
Move(Pointer(bytes)^, Pointer(ansiStr)^, Length(bytes));
David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • thanks David, sad that their is no other way than copying the memory :( – zeus Jan 19 '18 at 08:13
  • It's the requirement to have it as a TBytes that forces the memory copy. Rethink that requirement to avoid the copy. – David Heffernan Jan 19 '18 at 08:18
  • 1
    Anyway, why are you even working with ANSI encoded strings? Really no place for that. – David Heffernan Jan 19 '18 at 08:23
  • because ansiString are much more powerfull than unicodestring. on a webserver, every input and output you do are in 8Bit string (utf8), it's stupid to convert everything in 16 bytes string in the middle (and it's prone to error if the input was badly UTF8 encoded) – zeus Jan 19 '18 at 08:41
  • So UTF8String is "more powerful" ("powerful" in what way, actually?), but certainly not AnsiString (codepages are mainly a Windows concept). And not everyone writes webservers. – Rudy Velthuis Jan 19 '18 at 09:00
  • Pretty sure that you don't want to be converting from UTF8 to ANSI ....... – David Heffernan Jan 19 '18 at 10:01
  • 4
    @RudyVelthuis Not everyone writes webservers. No, but some people do. And for them performance matters. Just because you don't use Delphi in a particular way doesn't mean that it's not important to somebody. This is a common refrain of yours where you have a tendency to dismiss criticism if it pertains to a weakness that doesn't affect you. – David Heffernan Jan 19 '18 at 10:31
  • Still, even for a webserver, AnsiString is very likely not the most useful type. UTF8String could be, probably, but not AnsiString. Anyway, I don't dismiss criticism, but I know that loki wants Embarcadero to dump UnicodeString entirely and to replace it with AnsiSring (or UTF8String) again. That is what I meant. I don't think that UTF8String or AnsiString are "more powerful", especially since most APIs and platforms use UTF-16 by default. – Rudy Velthuis Jan 19 '18 at 10:59
  • @RudyVelthuis No, `AnsiString` isn't very useful these days. Some first class support for UTF-8 encoded strings would be nice. As you say, most platform APIs use UTF-16. Linux is the obvious exception. – David Heffernan Jan 19 '18 at 11:11
  • @David: I had actually expected that they would make UTF8String the main string type on Linux. But unfortunately, that was not the case. I guess they decided it would have been too much work and too many $IFDEFs in the runtime library. – Rudy Velthuis Jan 19 '18 at 11:45
  • @loki: in certain circumstances, you may not have to copy at all. It very much depends on what you do with the TBytes. In some circumstances, you could probably just cast (directly, or, if necessary, via a cast to Pointer, can't test that right now) to TBytes. But in that case, you must be sure when and how and you must absolutely know what you are doing. It would be quite a hack, but one that could save time. It all depends on what you want to do with the TBytes. – Rudy Velthuis Jan 19 '18 at 11:51
  • @Rudy: ansiString/UTF8string/etc. no matter it's the same, it's 8 bit string! And I work only with UTF8 data inside ansiString and not UTF8String because 99% of the tiny 8 bit functions are made with ansiString parameters and not UTF8String (like your last ansiString posEx that you made). Yes i would like to find a way to cast an ansiString to a Tbytes as in the end it's contain the same data but look like we can't :( – zeus Jan 19 '18 at 12:59
  • 1
    Actually, no, it's not the same. The encoding is different. The copy can surely be avoided. But not while you use both AnsiString and TBytes. – David Heffernan Jan 19 '18 at 13:02
  • @Rudy: also i don't want Embarcadero to dump UnicodeString, i want Embarcadero continue to fully support the ansiString (it's crazy that their is no function like inttostr in ansiString). With linux i was hoping they will reevaluate their position about ansiString, but as linux was made in arc i guess they have very (very) few clients and they will probably don't don't do anything more for linux :( – zeus Jan 19 '18 at 13:04
  • 1
    AnsiString makes no sense at all on Linux which has no ANSI encodings. – David Heffernan Jan 19 '18 at 13:07
  • @Loki: You can guess a lot, but I doubt you are right. Linux does not need AnsiString, it only needs UTF8String. Ansi and its codepages are for Windows. And ARC has absolutely nothing to do with that. – Rudy Velthuis Jan 19 '18 at 13:08
  • @David: POSIX has ways to decode/encode/convert Windows codepages, but IME this is a rather slow process, and not often used. I think it requires third party code as well (iconvert or UCI, IIRC), although it is included in the runtime for Delphi. Ansi is indeed a Windows idiom. A webserver should certainly not use it. – Rudy Velthuis Jan 19 '18 at 13:14
  • @loki: all these functions accept an UTF8String too, AFAIK. My RVPosEXA certainly does, but so do the others. You *do* know that UTF8String is an AnsiString with codepage 65001 (IIRC), right? – Rudy Velthuis Jan 19 '18 at 13:22
  • @Rudy, yes i know that AnsiString is UTF8String with codepage 65001. but it’s very important to avoid to use 2 differents string type (eg UTF8string and aniString) even if they have the same codepage, because compiler at compile time don’t know that codepage is the same and will do a transliteration (ex MyAnsiStringUTF8 := MyUTF8String will result in UTF8 => UTF16 => UTF8) – zeus Jan 19 '18 at 14:10
  • @David: yes i know, when i speak about ansiString i speak about 8 bit string. personally if we can remove the codepage information from the ansiString it's will the greatest way :) – zeus Jan 19 '18 at 14:14
  • @loki: indeed, don't use different string types. Just use UTF8String, instead of an AnsiString (probably with the local codepage) with UTF-8 content. Get rid of your AnsiStringWithUTF8Content. No conversion will take place if your strings have the same codepage. – Rudy Velthuis Jan 19 '18 at 14:15
  • Why don't you just use `TBytes` throughout? – David Heffernan Jan 19 '18 at 14:17
  • @Rudy, I did test, and doing anAnsiString := anUTF8String make some transliteration to utf16 to go back to utf8 :( so the only way is to have UTF8String everywhere everywhere of the everywhere :( – zeus Jan 19 '18 at 14:17
  • @DavidHeffernan: because how you do something like pos('xxx', myTbytes) – zeus Jan 19 '18 at 14:19
  • Write a simple function to do so. If you really care about avoiding copying and heap allocation, and transcoding, then that's a simple way out. – David Heffernan Jan 19 '18 at 14:22
  • I come back for one remark, about ansitring that must not contain bytes, it's a non sense. if you look for exemple in delphi source code the implementation of procedure BinToHex(Buffer: PAnsiChar; Text: PWideChar; BufSize: Integer); ... for emb PansiChar can contain bytes but ansiString can not when PansiChar and ansistring have the same purpose ! totaly absurde, off course ansiString can contain bytes, just this transliteration must be deactivated ! – zeus Jan 19 '18 at 21:57
  • BinToHex should accept PByte. It's wrong that it could accept PAnsiString. – David Heffernan Jan 19 '18 at 22:05