2

I have ancountered a problem in saving of a string containing German letters to a txt file. The MCVE looks like this:

procedure TForm1.Button1Click(Sender: TObject);
var
  s: string; //alias for UnicodeString 
  tf: textfile;
  ms: tmemorystream;
begin
  s := 'ßüÜöÖäÄФфшШ';
  assignfile(tf, 'b:\tmp.txt');
  Rewrite(tf);
  write(tf, s);
  closefile(tf);
  ms := tmemorystream.Create;
  try
    ms.WriteBuffer(Pointer(s)^, Length(s) * SizeOf(s[Low(s)]));
    ms.Position := 0;
    ms.SaveToFile('b:\tmp2.txt');
  finally
    ms.Free;
  end;
end;

If the string is saved directly to the file we get the following: tmp.txt?uUoOaAФфшШ. The German letters are changed though Cyrrilic letters remain. If the string is saved by TMemoryStream the result is proper: tmp2.txtßüÜöÖäÄФфшШ. What is the reason for this?

Appended

I decided to add the HEX values for the given string saved in different ways:

For Write method:

data: array[0..10] of byte = (
    $3F, $75, $55, $6F, $4F, $61, $41, $D4, $F4, $F8, $D8
);

For Write method called after AssignFile(tf, 'b:\tmp.txt',CP_UTF8);:

data: array[0..21] of byte = (
    $C3, $9F, $C3, $BC, $C3, $9C, $C3, $B6, $C3, $96, $C3, $A4, $C3, $84, $D0, $A4, 
    $D1, $84, $D1, $88, $D0, $A8
);

For TMemoryStream:

data: array[0..21] of byte = (
    $DF, $00, $FC, $00, $DC, $00, $F6, $00, $D6, $00, $E4, $00, $C4, $00, $24, $04, 
    $44, $04, $48, $04, $28, $04
);

For TStringList:

data: array[0..27] of byte = (
        $FF, $FE, $DF, $00, $FC, $00, $DC, $00, $F6, $00, $D6, $00, $E4, $00, $C4, $00, 
        $24, $04, $44, $04, $48, $04, $28, $04, $0D, $00, $0A, $00
    );

Appended

upon the valued advice of @Remy-Lebeau: This method generates a file of 25 bytes long. It is alike with HEX generated by Write method called after AssignFile(tf, 'b:\tmp.txt',CP_UTF8); with additional 3 bytes (BOM?).

data: array[0..24] of byte = (
    $EF, $BB, $BF, $C3, $9F, $C3, $BC, $C3, $9C, $C3, $B6, $C3, $96, $C3, $A4, $C3, 
    $84, $D0, $A4, $D1, $84, $D1, $88, $D0, $A8
);
asd-tm
  • 3,381
  • 2
  • 24
  • 41
  • See [unicode text file output differs between XE2 and Delphi 2009?](http://stackoverflow.com/a/14243866/576719). – LU RD Aug 30 '15 at 08:33
  • Thank you, @LU-RD. I have Delphi XE5. Your comment helped me. The code is working properly with the following line: `assignfile(tf, 'b:\tmp.txt',CP_UTF8);` Please, post an answer, so that I could accept it. – asd-tm Aug 30 '15 at 08:38
  • I guess what you need to decide is how you wish to encode your text. You should take that decision actively. – David Heffernan Aug 30 '15 at 21:56
  • Thank you, @DavidHeffernan. Did I understand you properly that writing of a string variable to a file requires explict indication of the encoding? But does that mean that writng a `File of somerecord` containing `records` with `strings` (of fixed length) will lead to data corruption? – asd-tm Aug 31 '15 at 18:31
  • Simple question. What encoding do you want to use for your text. – David Heffernan Aug 31 '15 at 18:50
  • @DavidHeffernan I have appended the question with HEX sequences for different ways of saving the. It might be interesting for the community. I think that I'll have to save in UTF-8. – asd-tm Aug 31 '15 at 18:59
  • Good to see you are no longer observing results using Notepad! I guess now you can see why you get an extra bytes with `TStringList` method. – Free Consulting Aug 31 '15 at 20:47
  • Yes, BOM encoded as UTF-8 is exactly these 3 bytes. – Free Consulting Aug 31 '15 at 20:50
  • Thank you, @FreeConsulting. It is. But what is the encoding then for TStringList by default? – asd-tm Aug 31 '15 at 20:55
  • Note that string of fixed length in Delphi is `ShortString` and it is single byte. Saving such `record` to file with UTF-8 encoding would require you use explicit `Utf8Encode` call. – Free Consulting Aug 31 '15 at 20:55
  • @FreeConsulting, thank you for your time once again. I'll note your kind advice. – asd-tm Aug 31 '15 at 21:01
  • @asd-tm, there are pretty long chain of defaults and according to [documentation](http://docwiki.embarcadero.com/Libraries/XE8/en/System.Classes.TStrings.DefaultEncoding) Delphi on Windows will reduce your data to ANSI which doesn't suit your needs. – Free Consulting Aug 31 '15 at 21:02

3 Answers3

3

To store unicode strings in text files with the Write/WriteLn procedures, you must assign a proper codepage first:

AssignFile(tf, 'b:\tmp.txt',CP_UTF8);

To persist the file against different locales, you can put a BOM first in the file as well:

Write(tf, #$FEFF);  // An utf8 BOM
LU RD
  • 34,438
  • 5
  • 88
  • 296
  • Thank you. Does this mean that writing method of `TMemoryStream.SaveToFile` differ with that of `Write`? – asd-tm Aug 30 '15 at 09:10
  • Yes, the TMemoryStream stores the file in UTF-16 while this example stores the file in UTF-8. – LU RD Aug 30 '15 at 09:19
  • That's rather bad advice. Since data unit of UTF-8 fits one byte, Byte Order Mark [is redundant with UTF-8](http://stackoverflow.com/a/2223926/205376) encoding and it's use discouraged by Unicode Consortium. – Free Consulting Aug 30 '15 at 14:37
  • @FreeConsulting, note that I used the word "can". There are circumstances where a BOM is a better choice for UTF-8 files. – LU RD Aug 30 '15 at 15:45
  • @LURD, sure thing. But I do not agree on your rationale to use it. UTF-8 encoded files are already compatible with any locale, regardless of BOM presence. Requirement to include or exclude BOM is merely a matter of support for old and/or broken software implementations. – Free Consulting Aug 30 '15 at 16:18
2

I belibe allways use the function in the RTL if they can do the job. And in this case TStringList does the trick for you in a very simple way:

In this small example I save a stringlist to a text file and loads it back again. Just to prove it works I've added an Assert test after Iøve loaded the text file again.

So no need to use MemoryStream and concerning about BOM. Use a TStringList, because it have all the functionality you need.

procedure TForm1.Button1Click(Sender: TObject);
var
  s: String;
begin
  s := 'ßüÜöÖäÄФфшШ';

  with TStringList.Create do
    try
      Text := s;
      SaveToFile('C:\aa\tmp3.txt', TEncoding.Unicode);
    finally
      free;
    end;

  with TStringList.Create do
    try
      LoadFromFile('C:\aa\tmp3.txt');
      Assert(Strings[0] = s, '');
    finally
      free;
    end;
end;
Jens Borrisholt
  • 6,174
  • 1
  • 33
  • 67
  • It's really necessary to mention what your example writes UTF-16 LE (and reads unspecified) encoding. – Free Consulting Aug 30 '15 at 14:57
  • Thank you, @JensBorrisholt. Sorry, I can not accept more then one answer, I could only upvote. I've noticed that the size of the files differ. If it was written by `TMemoryStream` or by `Write` after calling `AssignFile(tf, 'b:\tmp.txt',CP_UTF8);` the size was 22 bytes. I thought that a 11 characters long string was to have 22 bytes size + BOM 2 bytes, wasn't it? After writing with TStringList it was 28 bytes. If I open these files with NotePad they present the same string. Does that mean, that if a file is written with `TStringList` it must be read by `TStringList` as well? – asd-tm Aug 30 '15 at 18:27
  • @JensBorrisholt, OK, not "unspecified", but rather "detected using `Preamble`". As you see, failing to **clearly indicate encoding** used, you caused more confusion to an OP. – Free Consulting Aug 30 '15 at 18:44
  • @asd-tm it's ok. The reason for the extra bytes is that I saved it using Unicode and not UTF8. If you want it saved in UTF8 replace TEncoding.Unicode with TEncoding.UTF8 – Jens Borrisholt Aug 30 '15 at 18:50
  • The reason why I do not specify encoding when loading is becauseTStringlist detects the encoding automatically – Jens Borrisholt Aug 30 '15 at 18:51
  • UPs didn't see your question. No you can write with TStringlist and read it with Tmemorystream – Jens Borrisholt Aug 30 '15 at 18:52
  • @asd-tm, UTF-8 BOM is 3 bytes, not 2. – Free Consulting Aug 30 '15 at 18:59
  • @JensBorrisholt, I know the reason, but an OP and any further readers might not. Please clearly indicate what you were doing while writing the resulting file. – Free Consulting Aug 30 '15 at 19:07
  • Sadly, I failed to convince you. Downvoted because of that and the possible bug with extra bytes. – Free Consulting Aug 30 '15 at 20:17
  • @FreeConsulting first of all the answer about Unicode vs UTF8 wasn't for you som please be querit , and wait until you are asked. Second it wasnt a bug, it was just an other codepage. – Jens Borrisholt Aug 31 '15 at 03:42
  • You could post better quality answer for me to be satisfied. Meanwhile, 11 Unicode characters as per OP encoded using UTF-16LE + BOM preamble = 24 bytes, not 28. – Free Consulting Aug 31 '15 at 13:20
  • As said before: Nobody called your number. – Jens Borrisholt Aug 31 '15 at 16:13
  • Dear Jens Borrisholt and @FreeConsulting. I have just appended my question by the HEXes for different ways of saving the string. – asd-tm Aug 31 '15 at 19:00
1

In this situation, try using a TStreamWriter, eg:

procedure TForm1.Button1Click(Sender: TObject);
var
  s: string; //alias for UnicodeString 
  writer: TStreamWriter;
begin
  s := 'ßüÜöÖäÄФфшШ';
  writer := TStreamWriter.Create('b:\tmp.txt', False, TEncoding.UTF8);
  try
    writer.Write(s);
  finally
    writer.Free;
  end;
  ms := TMemoryStream.Create;
  try
    writer := TStreamWriter.Create(ms, TEncoding.UTF8);
    try
      writer.Write(s);
    finally
      writer.Free;
    end;
    ms.Position := 0;
    ms.SaveToFile('b:\tmp2.txt');
  finally
    ms.Free;
  end;
end;
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Thank you, Remy Lebeau. I have appended my post with HEX generated by your code. All the answers are very useful and unfortunately I can accept only one. I can only upvote your valued consideration. – asd-tm Aug 31 '15 at 20:58