14

I have a question about a specific programming problem in Delphi 10.2 Pascal programming language.

The StringOfChar and FillChar don’t work properly under Win64 Release build on CPUs released before year 2012.

  • Expected result of FillChar is just plain sequence of just repeating 8-bit characters in a given memory buffer.

  • Expected result of StringOfChar is the same, but the result is stored inside a string type.

But, in fact, when I compile our applications that worked in Delphi prior to 10.2 by the 10.2 version of Delphi, our applications compiled for Win64 stop working properly on CPUs released before year 2012.

The StringOfChar and FillChar don’t work properly – they return a string of different characters, although in a repeating pattern – not just a sequence of the same character as they should.

Here is the minimal code enough to demonstrate the issue. Please note that the length of the sequence should be at least 16 characters, and the character should not be nul (#0). The code is below:

procedure TestStringOfChar;
var
  a: AnsiString;
  ac: AnsiChar;
begin
  ac := #1;
  a := StringOfChar(ac, 43);
  if a <> #1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1 then
  begin
    raise Exception.Create('ANSI StringOfChar Failed!!');
  end;
end;

I know that there are lots of Delphi programmers at StackOverflow. Are you experiencing the same problem? If yes, how you resolve it? What is the solution? By the way, I have contacted the developers of Delphi but they didn’t confirm nor deny the issue so far. I'm using Embarcadero Delphi 10.2 Version 25.0.26309.314.

Update:

If your CPU is manufactured in 2012 or later, additionally include the following lines before calling StringOfChar to reproduce the issue:

const
  ERMSBBit    = 1 shl 9; //$0200
begin
  CPUIDTable[7].EBX := CPUIDTable[7].EBX and not ERMSBBit;

As about the April 2017 RAD Studio 10.2 Hotfix for Toolchain Issues - have tried with it and without it - it didn't help. The issue exists regardless of the Hotfix.

Update #2

Embarcadero has confirmed and resolved this issue on 08/Aug/17 6:03 PM. So, in Delphi 10.2 Tokyo Release 1 (released on August 8, 2017) this bug is fixed.

Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
  • 4
    It is clearly an RTL bug, not a compiler bug. The only thing to do is either patch the RTL manually until Embarcadero fixes it properly, or else avoid using portions of the RTL affected by the bug. – Remy Lebeau May 14 '17 at 01:05
  • 1
    I'm surprised at some of the peculiarities of your example which raises questions about the exact conditions required to reproduce your problem. You certainly haven't chosen a short simple easily human checkable example (e.g. `StringOfChar('a', 3);`) which simply raises further questions about the conditions that are required to cause this problem to manifest. _Presumably you have actually tested this thoroughly?_ So: Do you really need a long string to reproduce? Do you need special chars (#1 is a control character) to reproduce? And more significantly what string ***is actually returned?*** – Disillusioned May 14 '17 at 04:17
  • @CraigYoung - Thank you for your attention to this issue. Actually returned is #1'ыЩы'#1'ыЩы'#1'ыЩы'#1'ыЩы'#1'ыЩы'#1'ыЩы'#1'ыЩы'#1'ыЩы'#1#1#1#1#1#1#1#1#1#1#1 – Maxim Masiutin May 14 '17 at 05:03
  • @CraigYoung - the string have to be longer than 16 characters. With just 3 characters as in your example it does not reproduce. With #0 (zero byte) it also does not reproduce. The character has to be non-zero. It also reproduce with any other character, like 'a' as you gave. Why did you wrote that you had further questions about the conditions? Isn't that too easy? :-) – Maxim Masiutin May 14 '17 at 05:07
  • 2
    Are you using the hotfix released recently. And you've submitted a bug report. In not sure what your question here is. We aren't the developers of this tool. We can't fix it. – David Heffernan May 14 '17 at 05:12
  • @CraigYoung - I have updated the question. If you have newer CPU, you may add the following code to reproduce the issue: const ERMSBBit = 1 shl 9; begin CPUIDTable[7].EBX := CPUIDTable[7].EBX and not ERMSBBit; – Maxim Masiutin May 14 '17 at 05:26
  • @DavidHeffernan - thank you for pointing that out. Yes, I did try this hotfix, and without it. I have updated the text to address that. – Maxim Masiutin May 14 '17 at 05:26
  • 2
    But what is the question. This is not the place to report compiler bugs. – David Heffernan May 14 '17 at 05:42
  • 1
    @RemyLebeau - you wrote "avoid using portions of the RTL" - but the problem is in FillChar, which is used everywhere. – Maxim Masiutin May 14 '17 at 11:13
  • So roll back to Berlin and wait for it to be fixed. What do you expect us to do? – David Heffernan May 14 '17 at 12:44
  • 1
    @DavidHeffernan you are not that friendly :-) I expected that you could also confirm or deny this, maybe proposing a workaround or patch, maybe your ideas, etc., maybe follow the link and add comments. I have reviewed the StackOverflow rules, it it turns out that my post formally complies :-) – Maxim Masiutin May 14 '17 at 13:38
  • 1
    No, this post is off topic. You'd be better asking at Google+ delphi devs Community. Fixing it is simple enough. Create a working version of the function and use a runtime code hook to replace the function. Many examples of that. Bigger worry is that compiler might be broken. Do you trust it? Just roll back and wait for Emba to fix it. – David Heffernan May 14 '17 at 13:42
  • @DavidHeffernan I have patched the System.pas and compiled it, so now it is OK. I didn't know about the "runtime code hook to replace the function". Didn't know this is possible for Delphi. Could you please send a link! At last the helpful answer. Please consider making a formal answer, and I'd be grateful! Maybe some people will also get useful information on how to replace FillChar from System.pas with a "runtime hook" you wrote about. – Maxim Masiutin May 14 '17 at 13:48
  • 1
    There are hundreds of examples of that here. http://stackoverflow.com/questions/8978177/patch-routine-call-in-delphi – David Heffernan May 14 '17 at 13:54
  • @DavidHeffernan Thank you for the link. I was aware of this method, but didn't consider this a good method though. We try to never modify the code segment. By the way, it was a good idea that Intel processor had "execute only" flag for code segments, but, unfortunately, this mode is not used by Windows. I would better recompile System.pas. Should you be interested in the patch and the insights that I have provided, and also why FillChar became very slow in 10.2, just visit my link. Thank you! And Thank You for you help. – Maxim Masiutin May 14 '17 at 14:02
  • 2
    Your program already generates code at runtime. That's how window procedures are hooked up to methods. Known as a thunk. You can use whatever solution you like. Personally I prefer to avoid recompiling RTL units. I consider that brittle and tricky to manage. Hooking code is mainstream and widely used. – David Heffernan May 14 '17 at 14:05
  • @DavidHeffernan Thank you, I will do that then. This also allows customizing cases for different CPU types that have different capabilities, e.x. AVX, ERMS, etc. – Maxim Masiutin May 14 '17 at 14:17
  • I just ran this code on my 2009 Mac, and it worked beautifully in both Win32 and Win64. On other words: **I can't reproduce the problem**. The CPU is an i7 from well before 2012. Tokyo build 25.0.26309.314. CPU: Intel Core i7 860 @2.8GHz (http://ark.intel.com/products/41316/Intel-Core-i7-860-Processor-8M-Cache-2_80-GHz) – Rudy Velthuis May 14 '17 at 23:08
  • Just to be sure, I even did the `ERMSBit` thing, and it didn't make a difference. – Rudy Velthuis May 14 '17 at 23:16
  • 2
    I noticed you wrote a long answer about ERMSB yourself. I see that FillChar checks it and if it is set, it does a plain and simple `REP STOSB`. Why don't you, for now, simply **set** the bit? Some fills might be a little slower, but how often is that a problem? – Rudy Velthuis May 14 '17 at 23:50
  • @RudyVelthuis - thank you, there is a link in my message - "I have contacted the developers". I did modify and recompile the System.pas for fixing the issue, but David Heffernan have proposed a very good alternative solution of run-time modifying the entry point of FillChar, that's why StackOverflow is helpful. – Maxim Masiutin May 15 '17 at 06:10
  • 1
    @David: "Hooking code is mainstream and widely used". Perhaps. Runtime patching code is certainly not mainstream (or shouldn't be) and always a hack. I agree that it is better than recompling RTL units. – Rudy Velthuis May 15 '17 at 07:27
  • 2
    @RudyVelthuis In the face of a defective library, one chooses the best hack available. I've made extensive use of code hooks, over many years, to fix defects that Emba have decided not to fix. This approach is the most readily maintainable. – David Heffernan May 15 '17 at 07:50
  • 1
    Can somenone give me a hint where I find some documentation about all this? [Embarcadero Technologies does not currently have any additional information](http://docwiki.embarcadero.com/Libraries/Tokyo/en/System.CPUIDTable). And why is the no kind of OS abstraction layer (WinApi) for this lowlevel things? – ventiseis May 16 '17 at 20:14
  • 1
    @ventiseis - you may find additional information at https://quality.embarcadero.com/browse/RSP-18071 (Delphi developer account is required to login). You can add yourself to watch for the issue. – Maxim Masiutin May 16 '17 at 20:17
  • 1
    @ventiseis if you just want the information on the CPUINFO structure that you have quoted - it is just a result of calling CPUID instruction. You can check the source code of the System.pas to see how the structure is filled as a result of multiple calls of the CPUID instruction. – Maxim Masiutin May 16 '17 at 20:19
  • 1
    @ventiseis: Documentation about what? Bugs? And this lowlevel thing is abstracted. On any OS but Win32 and Win64, the code is in "pure Pascal" and independent of assembly. I even found the "pure Pascal" code not any slower than the machine code versions. Of course CPUID can not be abstracted. It only exists on x86 and x86-64 systems. It is CPU-specific. – Rudy Velthuis May 17 '17 at 15:35
  • 1
    @RudyVelthuis Thank you for the hints. I understand that CPUID is a way to get info about the CPU - but I'd expect on that wiki page a big warning sign: if you fiddle around with `CPUIDTable` variable you can break `FillChar`. My opinion: If it that variable is essential for low-level functions it shouldn't be public writeable *or* these dependencies should be documented. And I'd prefer a pure Pascal implementation if possible..But besides this: I'm always interested to learn something new and this is certainly an area where my knowledge can improve. – ventiseis May 17 '17 at 19:54
  • 1
    Actually, it is not meant to break `FillChar` (and I can't even confirm it does). Fiddling with the EMRSB bit (**setting** the bit) can make `FillChar` slower (quite a lot actually), but in both cases it *should* produce the correct result. So this should perhaps be mentioned in the docs for `FillChar` (64 bit). – Rudy Velthuis May 17 '17 at 19:57
  • 1
    @RudyVelthuis, The reason that Emba's assembly code is not faster than the pure pascal versions is that the assembly code is badly written. A proper asm implementation would be a **lot** faster. – Johan May 23 '17 at 09:55
  • 1
    @Johan: in Tokyo, the assembly code for `FillChar` is not badly written at all, AFAICT. And before Tokyo, the implementation (for Win64) is in pure Pascal. BTW, the implementation from FastSystem is not faster or slower either. Could it be that there is some kind of caching going on? They all display the same speed. Only the ERMSB implementation (i.e. plain REP STOSB, on a suitable processor) turns out to be twice as fast. – Rudy Velthuis May 23 '17 at 10:27
  • @RudyVelthuis - if you fix the FillChar bug as described in https://quality.embarcadero.com/browse/RSP-18071 and also include a comparison that REP STOSB is only called for blocks of at least 512 bytes (since for smaller blocks, that implementation is faster), as described in https://quality.embarcadero.com/browse/RSP-18068 - you will get very good results indeed! – Maxim Masiutin May 24 '17 at 11:59
  • 1
    I tried on an old CPU and on a newer one, all in Tokyo and in Berlin. I also tried the fastSystem version. Neither was faster than the other. Only if I used Tokyo's new x64 code on a larger array ( > 512 bytes), it was actually faster, on the newer CPU (newer than 2012), probably because of how the newer CPUs are optimized for REP STOSB.. I tried many different array sizes, none of them a nice multiple of 2. Oh wait, the FastSystem version was twice as fast once it started using MOVNTI. Your mileage may vary. – Rudy Velthuis May 24 '17 at 13:34

2 Answers2

11

StringOfChar(A: AnsiChar, count) uses FillChar under the hood.

You can use the following code to fix the issue:

(*******************************************************
 System.FastSystem
 A fast drop-in addition to speed up function in system.pas
 It should compile and run in XE2 and beyond.
 Alpha version 0.5, fully tested in Win64
 (c) Copyright 2016 J. Bontes
   This Source Code Form is subject to the terms of the
   Mozilla Public License, v. 2.0.
   If a copy of the MPL was not distributed with this file,
   You can obtain one at http://mozilla.org/MPL/2.0/.
********************************************************
FillChar code is an altered version FillCharsse2 SynCommons.pas
which is part of Synopse framework by Arnaud Bouchez
********************************************************
Changelog
0.5 Initial version:
********************************************************)

unit FastSystem;

interface

procedure FillChar(var Dest; Count: NativeInt; Value: ansichar); inline; overload;
procedure FillChar(var Dest; Count: NativeInt; Value: Byte); overload;
procedure FillMemory(Destination: Pointer; Length: NativeUInt; Fill: Byte); inline;
{$EXTERNALSYM FillMemory}
procedure ZeroMemory(Destination: Pointer; Length: NativeUInt); inline;
{$EXTERNALSYM ZeroMemory}

implementation

procedure FillChar(var Dest; Count: NativeInt; Value: ansichar); inline; overload;
begin
  FillChar(Dest, Count, byte(Value));
end;

procedure FillMemory(Destination: Pointer; Length: NativeUInt; Fill: Byte);
begin
  FillChar(Destination^, Length, Fill);
end;

procedure ZeroMemory(Destination: Pointer; Length: NativeUInt); inline;
begin
  FillChar(Destination^, Length, 0);
end;

//This code is 3x faster than System.FillChar on x64.

{$ifdef CPUX64}
procedure FillChar(var Dest; Count: NativeInt; Value: Byte);
//rcx = dest
//rdx=count
//r8b=value
asm
              .noframe
              .align 16
              movzx r8,r8b           //There's no need to optimize for count <= 3
              mov rax,$0101010101010101
              mov r9d,edx
              imul rax,r8            //fill rax with value.
              cmp rdx,59             //Use simple code for small blocks.
              jl  @Below32
@Above32:     mov r11,rcx
              mov r8b,7              //code shrink to help alignment.
              lea r9,[rcx+rdx]       //r9=end of array
              sub rdx,8
              rep mov [rcx],rax
              add rcx,8
              and r11,r8             //and 7 See if dest is aligned
              jz @tail
@NotAligned:  xor rcx,r11            //align dest
              lea rdx,[rdx+r11]
@tail:        test r9,r8             //and 7 is tail aligned?
              jz @alignOK
@tailwrite:   mov [r9-8],rax         //no, we need to do a tail write
              and r9,r8              //and 7
              sub rdx,r9             //dec(count, tailcount)
@alignOK:     mov r10,rdx
              and edx,(32+16+8)      //count the partial iterations of the loop
              mov r8b,64             //code shrink to help alignment.
              mov r9,rdx
              jz @Initloop64
@partialloop: shr r9,1              //every instruction is 4 bytes
              lea r11,[rip + @partial +(4*7)] //start at the end of the loop
              sub r11,r9            //step back as needed
              add rcx,rdx            //add the partial loop count to dest
              cmp r10,r8             //do we need to do more loops?
              jmp r11                //do a partial loop
@Initloop64:  shr r10,6              //any work left?
              jz @done               //no, return
              mov rdx,r10
              shr r10,(19-6)         //use non-temporal move for > 512kb
              jnz @InitFillHuge
@Doloop64:    add rcx,r8
              dec edx
              mov [rcx-64+00H],rax
              mov [rcx-64+08H],rax
              mov [rcx-64+10H],rax
              mov [rcx-64+18H],rax
              mov [rcx-64+20H],rax
              mov [rcx-64+28H],rax
              mov [rcx-64+30H],rax
              mov [rcx-64+38H],rax
              jnz @DoLoop64
@done:        rep ret
              //db $66,$66,$0f,$1f,$44,$00,$00 //nop7
@partial:     mov [rcx-64+08H],rax
              mov [rcx-64+10H],rax
              mov [rcx-64+18H],rax
              mov [rcx-64+20H],rax
              mov [rcx-64+28H],rax
              mov [rcx-64+30H],rax
              mov [rcx-64+38H],rax
              jge @Initloop64        //are we done with all loops?
              rep ret
              db $0F,$1F,$40,$00
@InitFillHuge:
@FillHuge:    add rcx,r8
              dec rdx
              db $48,$0F,$C3,$41,$C0 // movnti  [rcx-64+00H],rax
              db $48,$0F,$C3,$41,$C8 // movnti  [rcx-64+08H],rax
              db $48,$0F,$C3,$41,$D0 // movnti  [rcx-64+10H],rax
              db $48,$0F,$C3,$41,$D8 // movnti  [rcx-64+18H],rax
              db $48,$0F,$C3,$41,$E0 // movnti  [rcx-64+20H],rax
              db $48,$0F,$C3,$41,$E8 // movnti  [rcx-64+28H],rax
              db $48,$0F,$C3,$41,$F0 // movnti  [rcx-64+30H],rax
              db $48,$0F,$C3,$41,$F8 // movnti  [rcx-64+38H],rax
              jnz @FillHuge
@donefillhuge:mfence
              rep ret
              db $0F,$1F,$44,$00,$00  //db $0F,$1F,$40,$00
@Below32:     and  r9d,not(3)
              jz @SizeIs3
@FillTail:    sub   edx,4
              lea   r10,[rip + @SmallFill + (15*4)]
              sub   r10,r9
              jmp   r10
@SmallFill:   rep mov [rcx+56], eax
              rep mov [rcx+52], eax
              rep mov [rcx+48], eax
              rep mov [rcx+44], eax
              rep mov [rcx+40], eax
              rep mov [rcx+36], eax
              rep mov [rcx+32], eax
              rep mov [rcx+28], eax
              rep mov [rcx+24], eax
              rep mov [rcx+20], eax
              rep mov [rcx+16], eax
              rep mov [rcx+12], eax
              rep mov [rcx+08], eax
              rep mov [rcx+04], eax
              mov [rcx],eax
@Fallthough:  mov [rcx+rdx],eax  //unaligned write to fix up tail
              rep ret

@SizeIs3:     shl edx,2           //r9 <= 3  r9*4
              lea r10,[rip + @do3 + (4*3)]
              sub r10,rdx
              jmp r10
@do3:         rep mov [rcx+2],al
@do2:         mov [rcx],ax
              ret
@do1:         mov [rcx],al
              rep ret
@do0:         rep ret
end;
{$endif}

The easiest way to fix your issue is to Download Mormot and include SynCommon.pas into your project. This will patch System.FillChar to the above code and include a couple of other performance improvements as well.

Note that you don't need all of Mormot, just SynCommons by itself.

Johan
  • 74,508
  • 24
  • 191
  • 319
  • 1
    This may be 3x faster in Berlin and below, but Tokyo got an optimized update. For me it works, but obviously for Maxim it doesn't. I could time both to see which is faster for me. Hang on... – Rudy Velthuis May 16 '17 at 19:34
  • 2
    OK, I checked. This is faster than the new System.FillChar too. About 2-3 times as fast. – Rudy Velthuis May 16 '17 at 20:02
  • 2
    Thanks again, Johan, for having contributed your x64 asm code to SynCommons.pas! – Arnaud Bouchez May 16 '17 at 20:16
  • 1
    In my system, using a 16 K buffer and Delphi 10.2.3 x64-Release build System.FillChar is 2x faster than this custom version. Exactly 2 times faster. The difference is a little less when buffer size increases, but it is still faster. Besides that, I can't recreate the bug at all. – Alexandre M May 03 '18 at 22:06
2

I took the test case from the FastCode Challenge - http://fastcode.sourceforge.net/

I have compiled the FillChar testing tool under Win64, and removed all 32-bit versions of FillChar present in the test.

I have left 2 versions of 64-bit FillChar:

  1. FC_TokyoBugfixAVXEx - the one present in Delphi Tokyo 64-bit, with bugs fixed and AVX registers added. There is branching to detect ERMSB, AVX1 and AVX2 CPU capabilities. This branching happens on each FillChar call. There is no entry point patching or function address mapping.
  2. FillChar_J_Bontes - another version of FillChar, the function from System.FastSystem that you have posted here.

I didn't test vanilla FillChar from Delphi Tokyo, because it contains a bug described in my initial post, and it improperly handles ERMSB.

Kaby Lake - i7-7700K

FillChar Results Kaby Lake - i7-7700K

First column is the alignment of the function. Next 4 columns are results of various tests, lower is better. There are 4 tests in total. First test operates with smaller block, second with larger, and so on. Last column is a weighted summary of all tests.

The CPU in the first test is Kaby Lake i7-7700K (January 2017). Frequency 4.2 GHz (turbo frequency up to 4.5 GHz), L2 cache 4 × 256 KB, L3 cache 8 MB.

Ivy Bridge - E5-2603 v2

Here are the results of a second test, on a previous microarchitecture: Xeon E5-2603 v2 "Ivy Bridge" (September 2013), frequency 1.8 GHz, L2 Cache 4 × 256 KB, L3 Cache 10 MB, RAM 4 × DDR3-1333.

Results Xeon E5-2603 v2

Ivy Bridge - E5-2643 v2

Here are the test results on a third set of hardware: Intel Xeon E5-2643 v2 (September 2013), frequency 3.5 GHz, L2 Cache 6 × 256 KB, L3 Cache 25 MB, RAM 4 × DDR3-1600.

Results Xeon E5-2643 v2

Intel Core i9 7900X

Here are the test results on a fourth set of hardware: Intel Core i9 7900X (June 2017), frequency 3.3 GHz (turbo frequency up to 4.5 GHz), L2 Cache 10 × 1024 KB, L3 Cache 13.75 MB, RAM 4 × DDR4-2134.

FillChar Results Intel Core i9 7900X

Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72