12

When I try the code below there seem to be different output in XE2 compared to D2009.

procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
    myByte: Byte;

begin
  assignfile(Outfile,'test_chinese.txt');
  Rewrite(Outfile);

  for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
  //This is the UTF-8 BOM

  Writeln(Outfile,utf8string('总结'));
  Writeln(Outfile,'°C');
  Closefile(Outfile);
end;

Compiling with XE2 on a Windows 8 PC gives in WordPad

?? C

txt hex code: EF BB BF 3F 3F 0D 0A B0 43 0D 0A

Compiling with D2009 on a Windows XP PC gives in Wordpad

总结 °C

txt hex code: EF BB BF E6 80 BB E7 BB 93 0D 0A B0 43 0D 0A

My questions is why it differs and how can I save Chinese characters to a text file using the old text file I/O?

Thanks!

Thomas
  • 375
  • 1
  • 2
  • 11
  • 4
    Old text file IO officially does not support unicode. Don't rely on it. If you do, be aware that the implementation is buggy, and the bugs vary by Delphi version. – Jeroen Wiert Pluimers Jan 09 '13 at 11:06
  • 1
    Isn't this a job for `TStreamWriter`? – David Heffernan Jan 09 '13 at 12:05
  • Actually, in XE2 at least, the old-style file I/O does have some support for Unicode. `AssigFile()` has an optional `CodePage` parameter, and `Write/ln()` has overloads that accept `UnicodeString` and `WideChar` inputs. – Remy Lebeau Jan 09 '13 at 18:23

3 Answers3

21

In XE2 onwards, AssignFile() has an optional CodePage parameter that sets the codepage of the output file:

function AssignFile(var F: File; FileName: String; [CodePage: Word]): Integer; overload;

Write() and Writeln() both have overloads that support UnicodeString and WideChar inputs.

So, you can create a file that has its codepage set to CP_UTF8, and then Write/ln() will automatically convert Unicode strings to UTF-8 when writing them to the file.

The downside is that you will not be able to write the UTF-8 BOM using AnsiChar values anymore, because the individual bytes will get converted to UTF-8 and thus not be written correctly. You can get around that by writing the BOM as a single Unicode character (which it what it really is - U+FEFF) instead of as individual bytes.

This works in XE2:

procedure TForm1.Button1Click(Sender: TObject);
var
  Outfile: TextFile;
begin
  AssignFile(Outfile, 'test_chinese.txt', CP_UTF8);
  Rewrite(Outfile);

  //This is the UTF-8 BOM
  Write(Outfile, #$FEFF);

  Writeln(Outfile, '总结');
  Writeln(Outfile, '°C');
  CloseFile(Outfile);
end;

With that said, if you want something that is more compatible and reliable between D2009 and XE2, use TStreamWriter instead:

procedure TForm1.Button1Click(Sender: TObject);
var
  Outfile: TStreamWriter;
begin
  Outfile := TStreamWriter.Create('test_chinese.txt', False, TEncoding.UTF8);
  try
    Outfile.WriteLine('总结');
    Outfile.WriteLine('°C');
  finally
    Outfile.Free;
  end;
end;

Or do the file I/O manually:

procedure TForm1.Button1Click(Sender: TObject);
var
  Outfile: TFileStream;
  BOM: TBytes;

  procedure WriteBytes(const B: TBytes);
  begin
    if B <> '' then Outfile.WriteBuffer(B[0], Length(B));
  end;

  procedure WriteStr(const S: UTF8String);
  begin
    if S <> '' then Outfile.WriteBuffer(S[1], Length(S));
  end;

  procedure WriteLine(const S: UTF8String);
  begin
    WriteStr(S);
    WriteStr(sLineBreak);
  end;

begin
  Outfile := TFileStream.Create('test_chinese.txt', fmCreate);
  try
    WriteBytes(TEncoding.UTF8.GetPreamble);
    WriteLine('总结');
    WriteLine('°C');
  finally
    Outfile.Free;
  end;
end;
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
6

You really shouldn't use the old text I/O anymore.

Anyway, you can use TEncoding to get the UTF-8 TBytes like this:

procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
    Bytes: TBytes;
    myByte: Byte;
begin
  assignfile(Outfile,'test_chinese.txt');
  Rewrite(Outfile);

  for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
  //This is the UTF-8 BOM

  Bytes := TEncoding.UTF8.GetBytes('总结');
  for myByte in Bytes do begin
    Write(Outfile, AnsiChar(myByte));
  end;

  Writeln(Outfile,'°C');
  Closefile(Outfile);
end;

I'm not sure if there is an easier way to write TBytes to a Textfile, maybe somebody else has a better idea.

Edit:

For a pure binary file (File instead of TextFile type) use can use BlockWrite.

Jens Mühlenhoff
  • 14,565
  • 6
  • 56
  • 113
  • TFile.WriteAllBytes(const Path: string; const Bytes: TBytes) from System.IOUtils can write TBytes to a file. – Giel Jan 09 '13 at 12:13
  • @Giel That won't write the BOM. And it's not convenient if you want to write the content in bits and pieces. – David Heffernan Jan 09 '13 at 12:41
  • Thank you Jens. The reason that I use the old text I/O is that the D2009 project has many lines of code and I just want to make a quick and dirty solution using XE2. – Thomas Jan 10 '13 at 09:18
5

There are a couple of tell-tale signs that may tell you what whent wrong when dealing with Unicode. In your case you're seeing "?" in the resulting output file: You get question marks when you try to convert some thing from Unicode to a Code Page and the target Code Page can't represent the requested characters.

Looking at the hex dump it's obvious (counting line terminators) that the question marks are the result of saving the two Chinese characters to the file. The two chars got converted to exactly two question marks. This tells you the Writeln() decided to give you helping and converted the text from UTF8 (a unicode representation) to your local code page. The Delphi team probably decided to do this since the old I/O routines are not supposed to be UNICODE compatible; since you're writing an UTF8 string using the old I/O routines, they're helping you by converting this to your Code Page. You might not welcome that helping hand, but it doesn't mean it was wrong to do so: it's undocumented territory.

Since you now know why that's happening you know what to do to stop it. Let WriteLn() know you're sending something that doesn't need converting. You'll discover that's not particularly easy, since Delphi XE2 apparently "helps you out" whatever you. For example, stuff like this doesn't just change the string type, it converts to AnsiString, going through the code-page conversion routine that gets you question marks:

AnsiString(UTF8String('Whatever Unicode'));

Because of this, and if you need one-liner solutions, you could try a conversion routine, something like this:

function FakeConvert(const InStr: UTF8String): AnsiString;
var N: Integer;
begin
  N := Length(InStr);
  SetLength(Result, N);
  Move(InStr[1], Result[1], N);
end;

You'll then be able to do:

Writeln(Outfile,FakeConvert('总结'));

And it'll do what you expect (I did actually try it before posting!)

Of course the only TRUE answer to this question is, since you upgraded all the way to Delphi XE2:

Stop using deprecated I/O routines, move to TStream based

Community
  • 1
  • 1
Cosmin Prund
  • 25,498
  • 2
  • 60
  • 104
  • And thanks to you Cosmin for this solution and explanation, too! – Thomas Jan 09 '13 at 15:03
  • 2
    There is an easier solution. In XE2, at least, `TextFile` and `Writeln()` actually **do** support Unicode. See my answer for an example. – Remy Lebeau Jan 09 '13 at 18:22
  • Delphi XE2 (including latest XE5) has been committing a big mistake when good old writeln is broken. writeln is very useful and fast, my test case shows that TStreamWriter is extremely slow. and when you write console or even cgi apps, using TStreamWriter is out of the question. – mandel Jan 14 '14 at 17:13