13

I have discovered (the hard way) that if a file has a valid UTF-8 BOM but contains any invalid UTF8 encodings, and is read by any of the Delphi (2009+) encoding-enabled methods such as LoadFromFile, then the result is a completely empty file with no error indication. In several of my applications, I would prefer to simply lose a few bad encodings, even if I get no error report in this case either.

Debugging reveals that MultiByteToWideChar is called twice, first to get the output buffer size, then to do the conversion. But TEncoding.UTF8 contains a private FMBToWCharFlags value for these calls, and this is initialized with a MB_ERR_INVALID_CHARS value. So the call to get the charcount returns 0 and the loaded file is completely empty. Calling this API without the flag would 'silently drop illegal code points'.

My question is how best to weave through the nest of classes in the Encoding area to work around the fact that this is a private value (and needs to be, because it is a class var for all threads). I think I could add a custom UTF8 encoding, using the guidance in Marco Cantu's Delphi 2009 book. And it could optionally raise an exception if MultiByteToWideChar has returned an encoding error, after calling it again without the flag. But that does not solve the problem of how to get my custom encoding used instead of Tencoding.UTF8.

If I could just set this up as a default for the application at initialization, perhaps by actually modifying the class var for Tencoding.UFT8, this would probably be sufficient.

Of course, I need a solution without waiting to lodge a QC report asking for a more robust design, getting it accepted, and seeing it changed.

Any ideas would be very welcome. And can someone confirm this is still an issue for XE4, which I have not yet installed?

Ken White
  • 123,280
  • 14
  • 225
  • 444
frogb
  • 2,040
  • 15
  • 22
  • 1
    If you have an answer, please post it as an answer, not as an edit of the question. Otherwise the question will remain open forever with no answers. – Celada May 14 '13 at 01:01

4 Answers4

12

I ran into the MB_ERR_INVALID_CHARS issue when I first updated Indy to support TEncoding, and ended up implementing a custom TEncoding-derived class for UTF-8 handling to avoid specifying MB_ERR_INVALID_CHARS. I didn't think to use a class helper.

However, this issue is not just limited to UTF-8. Any decoding failure of any of the TEncoding classes will result in a blank result, not an exception being raised. Why Embarcadero chose that route, when most of the RTL/VCL uses exceptions instead, is beyond me. Not raising an exception on error caused a fair amount of issues in Indy that had to be worked around.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • 2
    +1 Deriving your own custom TEncoding is clearly what you are supposed to do. – David Heffernan May 14 '13 at 04:14
  • 1
    There are quite a few design and implementation problems with `TEncoding`, so in Indy 10.6 I decided to drop `TEncoding` completely and wrote my own interface-based framework to replace it. – Remy Lebeau May 14 '13 at 08:11
  • @David: how would you then get your encoding used when LoadFromFile detects a BOM? Would you have to read the first three bytes and then pass an encoding parameter for any UTF8 file you find? – frogb May 14 '13 at 09:01
  • @frogb: yes, you would. `TEncoding` does not allow for user-defined classes to be registered into its default BOM handling logic. – Remy Lebeau May 14 '13 at 15:30
  • @remy: thanks. I would have accepted your answer, which is clearly correct for someone maintaining Indy; but my own is more appropriate for me and more closely corresponds to my original question. As often happens, asking the question helps you to find the answer yourself! – frogb May 16 '13 at 17:13
  • @RemyLebeau is it part of Indy or is it available separately ? I also wanted to do it but is got eventually postponed to infinity. I cannot get why they tried to mimic GC-based DotNet classes with manual-memory-control implementation..... Do the occasional `TEncoding.UTF8.Free` - and wait for it to explode... – Arioch 'The Jan 16 '17 at 15:58
  • @Arioch'The: the code I wrote is for Indy only. It is the `IIdTextEncoding` interface and support classes/routines in the `IdGlobal` unit. – Remy Lebeau Jan 16 '17 at 19:39
3

This can be done pretty simply, at least in Delphi XE5 (have not checked earlier versions). Just instantiate your own TUTF8Encoding:

procedure LoadInvalidUTF8File(const Filename: string);
var
  FEncoding: TUTF8Encoding;
begin
  FEncoding := TUTF8Encoding.Create(CP_UTF8, 0, 0); 
                      // Instead of CP_UTF8, MB_ERR_INVALID_CHARS, 0
  try
    with TStringList.Create do
    try
      LoadFromFile(Filename, FEncoding);
      // ...
    finally
      Free;
    end;
  finally
    FEncoding.Free;
  end;
end;

The only issue here is that the IsSingleByte property for the newly instantiated TUTF8Encoding is then incorrectly set to False, but this property is not currently used anywhere in the Delphi sources.

Marc Durdin
  • 1,675
  • 2
  • 20
  • 27
  • Unfortunately that solution is only useful if you know the file contains invalid characters. Our software only needs to handle Unicode, UTF8 and system default encoding, so the real problem was on loading a file without an encoding parameter. The VCL then 'worked' in all cases except when a file correctly detected as having a UTF8 BOM contained an invalid UTF8 sequence. Such a file ended up loaded as empty. – frogb Jul 30 '14 at 08:49
  • 1
    True -- this solution assumes that you know the encoding to be UTF-8, so it's not suitable if you are trying to sniff the encoding either by BOM or by content. – Marc Durdin Jul 31 '14 at 06:53
1

A partial workaround is to force the UTF8 encoding to suppress MB_ERR_INVALID_CHARS globally. For me, this avoids the need for raising an exception, because I find it makes MultiByteToWideChar not quite 'silent': it actually inserts $fffd characters (Unicode 'replacement character') which I can then find in the cases where this is important. The following code does this:

unit fixutf8;
interface
uses System.Sysutils;
type
  TUTF8fixer = class helper for Tmbcsencoding
  public
    procedure setflag0;
  end;

implementation
procedure TUTF8fixer.setflag0;
{$if CompilerVersion = 31}
asm
  XOR ECX,ECX
  MOV Self.FMBToWCharFlags,ECX
end;
{$else}
begin
  Self.FMBToWCharFlags := 0;
end;
{$endif}

procedure initencoding;
begin
  (Tencoding.UTF8 as TmbcsEncoding).setflag0;
end;

initialization
  initencoding;
end.

A more useful and principled fix would require changing the calls to MultiByteToWideChar not to use MB_ERR_INVALID_CHARS, and to make an initial call with this flag so that an exception could be raised after the load is complete, to indicate that characters will have been replaced.

There are relevant QC reports on this issue, including 76571, 79042 and 111980. The first one has been resolved 'as designed'.

(Edited to work with Delphi Berlin)

frogb
  • 2,040
  • 15
  • 22
  • until Delphi 10.1 you could just `class helper for Tmbcsencoding public property UnicodeFlags: cardinal read FMBToWCharFlags write FMBToWCharFlags end;` then use `initialization Tencoding.UTF8.UnicodeFlags := 0; end.` – Arioch 'The Jan 16 '17 at 15:17
  • It also would not work if one obtains `TUTF8Encoding` object(s) by other means than `TEncoding.GetUTF8`, for example in XE2 `TEncoding.GetEncoding(CP_UTF8)` would create a new instance of `TUTF8Encoding` instead of local one – Arioch 'The Jan 16 '17 at 15:47
  • The purpose of the conditional compilation was to retain the original posted solution for code earlier than Berlin, using the code helper as originally implemented. I left indeterminate what is to be done for future compilers because even the ASM solution may be closed off in a future release. – frogb Jan 17 '17 at 23:08
  • As I explain below, the purpose of the accepted code was to fix the built-in UTF8 detection. I have no interest in obtaining new encoding objects. But thanks anyway. – frogb Jan 17 '17 at 23:11
  • you can not ensure that the libraries u use do not do that obtaining. Those "new objects" are "built-in detection" same way exactly. More so, should any library call standard `FreeEncodings` method for any reason and the object would be re-created – Arioch 'The Jan 18 '17 at 09:38
0

Your "global" approach is not really global - it relies upon the assumption that all the code would only use one and the same instance of TUTF8Encoding. The same instance where you hacked the flags field.

But it would not work if one obtain TUTF8Encoding object(s) by other means than TEncoding.GetUTF8, for example in XE2 another method - TEncoding.GetEncoding(CP_UTF8) - would create a new instance of TUTF8Encoding instead of re-using FUTF8 shared one. Or some function might run TUTF8Encode.Create directly.

So i'd suggest two more approaches.

Approach with patching the class implementation, somewhat hacky. You introduce your own class for the sake of obtaining new "fixes" constructor body.

type TMyUTF8Encoding = class(TUTF8Encoding)
  public constructor Create; override;
end;

This constructor would be the copycat of TUTF8Encoding.Create() implementation, except for setting the flag as you want it ( in XE2 it is done by calling another, inherited Create(x,y,z) so u would not need an access to the private field ) instead.

Then you can patch the stock TUTF8Encoding VMT overriding its virtual constructor to that new constructor of yours.

You may read Delphi documentation about "internal formats" and so forth, to get the VMT layout. You would also need calling VirtualProtect (or other platform-specific function) to remove protection from VMT memory area before patching and then to restore it.

Examples to learn from

Or you may try using Delphi Detours library, hopefully it can patch virtual constructors. Then... it might be an overkill here to use that rather complex lib for that single goal.

After you hacked the TUTF8Encoding class do call the TEncoding.FreeEncodings to remove the already created shared instances (if any) if any and thus trigger recreating the UTF8 instances with your modifications.


Then, if you compile your program as a single monolithic EXE , without using runtime BPL modules, you just can copy the SysUtils.pas sources to your application folder and then to include that local copy into your project explicitly.

How to patch a method in Classes.pas

There you would change the very TUTF8Encoding implementation as you see fit in the sources and Delphi would use it.

This brain-deadly simplistic (hence - equally reliable) approach would not work though if your projects would be built to reuse rtlNNN.bpl runtime package instead of being monolithic.

Community
  • 1
  • 1
Arioch 'The
  • 15,799
  • 35
  • 62
  • Thanks for your suggestions, which I hope are useful to someone else, but unfortunately they add nothing that I need. As I said when I first raised this issue, I never require encodings such as the MyEncoding that you create. The core of my problem was the AUTOMATIC detection of the encoding of a file passed to my application which is not under my control. So I NEVER need to supply an encoding. I simply need to avoid an exception, or an empty file, when an file with invalid UTF8 is presented and read. The solution I accepted has worked well for some years for me, which is why I so marked it. – frogb Jan 17 '17 at 23:14
  • you did not patch the AUTOMATIC detection in its entirety, but only one of many paths. You are building your safety upon two premonitions: no any library would ever use any other methods of obtaining the standard `TUTF8Encoding` object and that no any library would ever `Destroy` the single `TUTF8Encoding` object you patched. Both are shaky grounds, they may work for 99% cases then give you errors in the 1%. And because you have the false feeling that you "patched built-in UTF8 detection" (which you did only in part) you would never have hard times decidedly overlooking the source of those – Arioch 'The Jan 18 '17 at 09:45
  • `as the MyEncoding that you create` - is just a trampoline device to make Delphi build a function you then inject into standard TUTF8Encoding on a PERMANENT basis. You never use that class for itself. You miss the point - it should be `TUTF8Encoding` class that needs patching, not the instances of it. `MyEncoding` class is not the class to be used as in @Marc Durdin answer, you never instantiate it, it is merely a donor of fixed code to patch the built-in class with. – Arioch 'The Jan 18 '17 at 09:48
  • Thank you again for your comments. – frogb Jan 19 '17 at 11:15