1

I have an .URL file which contains the following text which contains a German Umlaut character:

[InternetShortcut]
URL=http://edn.embarcadero.com/article/44358
[MyApp]
Notes=Special Test geändert
Icon=default
Title=Bug fix list for RAD Studio XE8

I try to load the text with TMemIniFile:

uses System.IniFiles;
//
procedure TForm1.Button1Click(Sender: TObject);
var
  BookmarkIni: TMemIniFile;
begin
  // The error occurs here:      
  BookmarkIni := TMemIniFile.Create('F:\Bug fix list for RAD Studio XE8.url',
                                    TEncoding.UTF8);
  try
    // Some code here
  finally
    BookmarkIni.Free;
  end;
end;

This is the error message text from the debugger:

Project MyApp.exe raised exception class EEncodingError with message 'No mapping for the Unicode character exists in the target multi-byte code page'.

When I remove the word with the German Umlaut character "geändert" from the .URL file then there is NO error.

But that's why I use TMemIniFile, because TIniFile does not work here when the text in the .URL file contains Unicode characters. (There could also be other Unicode characters in the .URL file).

So why I get an exception here in TMemIniFile.Create?

EDIT: Found the culprit: The .URL file is in ANSI format. The error does not happen when the .URL file is in UTF-8 format. But what can I do when the file is in ANSI format?

EDIT2: I've created a workaround which does work BOTH with ANSI and UTF-8 files:

procedure TForm1.Button1Click(Sender: TObject);
var
  BookmarkIni: TMemIniFile;
  BookmarkIni_: TIniFile;
  ThisFileIsAnsi: Boolean;
begin
  try
    ThisFileIsAnsi := False;
    BookmarkIni := TMemIniFile.Create('F:\Bug fix list for RAD Studio XE8.url',
                                    TEncoding.UTF8);
  except
    BookmarkIni_ := TIniFile.Create('F:\Bug fix list for RAD Studio XE8.url');
    ThisFileIsAnsi := True;
  end;
  try
    // Some code here
  finally
    if ThisFileIsAnsi then
      BookmarkIni_.Free
    else
      BookmarkIni.Free;
  end;
end;

What do you think?

user1580348
  • 5,721
  • 4
  • 43
  • 105

3 Answers3

2

It is not possible, in general, to auto-detect the encoding of a file from its contents.

A clear demonstration of this is given by this article from Raymond Chen: The Notepad file encoding problem, redux. Raymond uses the example of a file containing these two bytes:

D0 AE

Raymond goes on to show that this is a well formed file with the following four encodings: ANSI 1252, UTF-8, UTF-16BE and UTF-16LE.

The take home lesson here is that you have to know the encoding of your file. Either agree it by convention with whoever writes the file. Or enforce the presence of a BOM.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
1

You need to decide on what the encoding of the file is, once and for all. There's no fool proof way to auto-detect this, so you'll have to enforce it from your code that creates these files.

If the creation of this file is outside your control, then you are more or less out of luck. You can try to rely of the BOM (Byte-Order-Mark) at the beginning of the file (which should be there if it is a UTF-8 file). I can't see from the specification of the TMemIniFile what the CREATE constructor without an encoding parameter assumes about the encoding of the file (my guess is that it follows the BOM and if there's no such thing, it assumes ANSI, ie. system codepage).

One thing you can do - if you decide to stick to your current method - is to change your code to:

procedure TForm1.Button1Click(Sender: TObject);
var
  BookmarkIni: TCustomIniFile;
begin
  // The error occurs here:
  try
    BookmarkIni := TMemIniFile.Create('F:\Bug fix list for RAD Studio XE8.url',
                                    TEncoding.UTF8);
  except
    BookmarkIni := TIniFile.Create('F:\Bug fix list for RAD Studio XE8.url');
  end;
  try
    // Some code here
  finally
    BookmarkIni.Free;
  end;
end;

You don't need two separate variables, as both TIniFile and TMemIniFile (as well as TRegistryIniFile) all have a common ancestor: TCustomIniFile. By declaring your variable as this common ancestor, you can instantiate (create) it as any of the class types that inherit from TCustomIniFile. The actual (run-time) type is determined depending on which construtcor you're calling to create.

But first, you should try to use

BookmarkIni := TMemIniFile.Create('F:\Bug fix list for RAD Studio XE8.url');

ie. without any encoding specified, and see if it works with both ANSI and UTF-8 files.

EDIT: Here's a test program to verify my claim made in the comments:

program Project21;

{$APPTYPE CONSOLE}

uses
  IniFiles, System.SysUtils;

const
  FileName = 'F:\Bug fix list for RAD Studio XE8.url';

var
  TXT : TextFile;

procedure Test;
var
  BookmarkIni: TCustomIniFile;
begin
  try
    BookmarkIni := TMemIniFile.Create(FileName,TEncoding.UTF8);
  except
    BookmarkIni := TIniFile.Create(FileName);
  end;
  try
    Writeln(BookmarkIni.ReadString('MyApp','Notes','xxx'))
  finally
    BookmarkIni.Free;
  end;
end;

begin
  try
    AssignFile(TXT,FileName); REWRITE(TXT);
    try
      WRITELN(TXT,'[InternetShortcut]');
      WRITELN(TXT,'URL=http://edn.embarcadero.com/article/44358');
      WRITELN(TXT,'[MyApp]');
      WRITELN(TXT,'Notes=The German a umlaut consists of the following two ANSI characters: '#$C3#$A4);
      WRITELN(TXT,'Icon=default');
      WRITELN(TXT,'Title=Bug fix list for RAD Studio XE8');
    finally
      CloseFile(TXT)
    end;
    Test;
    ReadLn
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.
HeartWare
  • 7,464
  • 2
  • 26
  • 30
  • `TMemIniFile.Create` without the Encoding parameter seems to work with both ANSI and UTF-8 files (at least there is no more exception thrown). Now I have to make some more tests to see whether the Read and Write methods give correct results with Unicode characters in this case. – user1580348 Jan 01 '16 at 21:56
  • When not specifying the encoding with `TMemIniFile.Create`, `BookmarkIni.ReadString` gives correct results with both `ANSI` and `UTF-8-BOM` files. `UTF-8` files without `BOM` gives strings with "funny characters" for Unicode characters. However, there seems to be no way to detect `UTF-8` files without `BOM`. – user1580348 Jan 01 '16 at 22:33
  • HEUREKA!!!!! When using my above workaround which uses `TIniFile` in the `except` section (which David called "terrible") gives back correct results with `BookmarkIni.ReadString` in ALL 3 cases!!! ANSI, UTF-8-BOM and UTF-8 without BOM!!! So this seems to be THE solution!! – user1580348 Jan 01 '16 at 22:45
  • Just to clarify: There are cases where your workaround won't work correctly, as there's no fool proof way to determine if a text string that is valid in UTF-8 encoding is supposed to be interpreted as UTF-8 or if the two bytes (or more) that comprise the Unicode character are meant to be understood as ANSI "funny characters". Using heuristics you can make a qualified guess as to what it is, but it's not fool proof. You should stick with the "no-encoding-specified" TMemIniFile and insist that the files you are processing either is ANSI or has a proper UTF-8 BOM. – HeartWare Jan 01 '16 at 22:52
  • Can you tell me in which cases my workaround will not work? So far it has worked correctly with all 3 file encoding formats. – user1580348 Jan 01 '16 at 22:55
  • Like I said: Take a string that is valid UTF-8 encoding. This is ALSO a valid ANSI coding - albeit with "funny looking" characters. How do you determine if the "funny looking" characters are supposed to _be_ those "funny looking" characters, or if they are supposed to be interpreted as a Unicode compund character? You can't - unless you know it beforehand, f.ex. using a BOM. – HeartWare Jan 01 '16 at 22:58
  • You didn't tell me in which cases my workaround will not work correctly, as you stated above. – user1580348 Jan 01 '16 at 23:08
  • Yes, I did... How is a umlaut encoded in UTF-8? Replace your "Special test" line with "The German a umlaut consists of two ANSI characters ". Then save the file without BOM and try to read it with your workaround. Will you get a proper sentence read out, or will it say "The German a umlaut consists of two ANSI characters "? – HeartWare Jan 01 '16 at 23:15
  • The German Umlaut character "ä" in UTF-8 consists of these two single bytes: "ä" (%C3%A4). So I saved this text as UTF-8 without BOM. My workaround did NOT read them as single two-byte UTF-8 character "ä" as you would have expected but as the two same single bytes "ä"! So my workaround also works in these cases! I have also tried it with other unusual Unicode characters and it works there too! – user1580348 Jan 01 '16 at 23:51
  • FYI, you don't have to resort to `TIniFile` to read a non-UTF8 file. `TMemIniFile` uses `TEncoding.Default` (Ansi) if you do not specify an encoding and no BOM is present. See [this discussion](http://codeverge.com/embarcadero.delphi.general/tmeminifile-detection-of-unicode/2026695) for details. – Remy Lebeau Jan 02 '16 at 00:59
  • @user1580348: That's odd - when I do it, it writes a single ä character. See my edited answer for a full program (I can't publish the code in a comment). – HeartWare Jan 02 '16 at 08:21
  • @user1580348 You are getting very confused here. You cannot auto detect encoding from content. My answer states and demonstrates that very clearly. – David Heffernan Jan 02 '16 at 10:53
0

The rule of thumb - to read data (file, stream whatever) correctly you must know the encoding! And the best solution is to let user to choose encoding or force one e.g. utf-8.

Moreover, the information ANSI does make things easier without code page.

A must read - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Other approach is to try to detect encoding (like browsers do with sites if no encoding specified). Detecting UTF is relatively easy if BOM exists, but more often is omitted. Take a look Mozilla's universalchardet or chsdet.

kwarunek
  • 12,141
  • 4
  • 43
  • 48