5

Is there any code for a Pos() version that's as fast in 64-bit than the current 32-bit?

To my understanding, the 32-bit version in Delphi (tested up to XE5) adopted the FastCode assembler versions many years ago, but for 64-bit it uses a PurePascal version, which is around 5 to 10 times slower.

Some tests, same procedure in a long loop:

32-bit: 65..90ms

64-bit: 280..300ms

Arioch 'The
  • 15,799
  • 35
  • 62
hikari
  • 3,393
  • 1
  • 33
  • 72
  • 1
    Please can you provide your test program – David Heffernan Jan 05 '14 at 20:58
  • Do you search for substrings or single chars ? – Arioch 'The Jan 05 '14 at 21:00
  • I search for substrings – hikari Jan 05 '14 at 23:09
  • I made a QC report about the slow `Pos` in purepascal. See [`Inefficient loop in Pos() for purepascal`](http://qc.embarcadero.com/wc/qcmain.aspx?d=111103) for a faster implementation. Conclusion was to use FastCoders purepascal version, `PosSHA`or `PosSHA2`. You can find the code on Fastcoders page, http://fastcode.sourceforge.net/ – LU RD Jan 05 '14 at 23:34
  • Hi, I have the FastCode files, but can'T locate the PosSHA/PosSHA2 function, could you please point me to them? thank you – hikari Jan 05 '14 at 23:47

2 Answers2

11

Using Fastcoders purepascal PosEx_Sha_Pas_2 algorithm (modified to fit x64):

function PosEx_Sha_Pas_2(const SubStr, S: string; Offset: Integer = 1): Integer;
Type
  PInteger =^Integer;
var
  len, lenSub: Integer;
  ch: char;
  p, pSub, pStart, pStop: pchar;
label
  Loop0, Loop4,
  TestT, Test0, Test1, Test2, Test3, Test4,
  AfterTestT, AfterTest0,
  Ret, Exit;
begin;
  pSub := pointer(SubStr);
  p := pointer(S);

  if (p = nil) or (pSub = nil) or (Offset < 1) then
  begin;
    Result := 0;
    goto Exit;
  end;

  lenSub := PLongInt(PByte(pSub) - 4)^ - 1; // <- Modified
  len := PLongInt(PByte(p) - 4)^; // <- Modified
  if (len < lenSub + Offset) or (lenSub < 0) then
  begin;
    Result := 0;
    goto Exit;
  end;

  pStop := p + len;
  p := p + lenSub;
  pSub := pSub + lenSub;
  pStart := p;
  p := p + Offset + 3;

  ch := pSub[0];
  lenSub := -lenSub;
  if p < pStop then
    goto Loop4;
  p := p - 4;
  goto Loop0;

Loop4:
  if ch = p[-4] then
    goto Test4;
  if ch = p[-3] then
    goto Test3;
  if ch = p[-2] then
    goto Test2;
  if ch = p[-1] then
    goto Test1;
Loop0:
  if ch = p[0] then
    goto Test0;
AfterTest0:
  if ch = p[1] then
    goto TestT;
AfterTestT:
  p := p + 6;
  if p < pStop then
    goto Loop4;
  p := p - 4;
  if p < pStop then
    goto Loop0;
  Result := 0;
  goto Exit;

Test3:
  p := p - 2;
Test1:
  p := p - 2;
TestT:
  len := lenSub;
  if lenSub <> 0 then
    repeat
      ;
      if (pSub[len] <> p[len + 1]) or (pSub[len + 1] <> p[len + 2]) then
        goto AfterTestT;
      len := len + 2;
    until len >= 0;
  p := p + 2;
  if p <= pStop then
    goto Ret;
  Result := 0;
  goto Exit;

Test4:
  p := p - 2;
Test2:
  p := p - 2;
Test0:
  len := lenSub;
  if lenSub <> 0 then
    repeat
      ;
      if (pSub[len] <> p[len]) or (pSub[len + 1] <> p[len + 1]) then
        goto AfterTest0;
      len := len + 2;
    until len >= 0;
  Inc(p);
Ret:
  Result := p - pStart;
Exit:
end;

And the result using David's test case (x64 release):

System.Pos       18427
wcsstr            8122
PosEx_Sha_Pas_2   2282

For the x32 release the results are:

System.Pos        2171
wcsstr            9634
PosEx_Sha_Pas_2   1868

Conclusion:

PosEx_Sha_Pas_2 is almost as fast in x64 bit mode as Pos in x32 bit mode. Additionally it seems as PosEx_Sha_Pas_2 is faster than Pos in x32 bit mode.

All tests here is with the XE4 version.


Improve purepascal System.Pos() still open in Tokyo 10.2.


Update:

As of Delphi 11.0 Alexandria, the purepascal version of Pos() is using the Fastcoders PosEx_Sha_Pas_2.pas version. Thanks @StefanGlienke for your effort in making this happen!

LU RD
  • 34,438
  • 5
  • 88
  • 296
  • 2
    Very nice. Applied this to a custom xml parser, speed became 10x faster in x64 with this Pos. Thanks! – hikari Jan 06 '14 at 14:45
  • I suspect we need a more comprehensive test before we can say that the SHA version is better than x86 asm Pos. Perhaps substring length is important. – David Heffernan Jan 06 '14 at 23:21
  • 1
    @DavidHeffernan, I made some more tests for 32 bit mode, and it seems as `System.Pos` is a bit faster if the substring is found within the first 20-25 characters. After that the SHA version is more effective. The size of the substring does not really have any effect. – LU RD Jan 07 '14 at 19:49
  • I just ran an older but similar test. At home, in my Win7 VM, there was no big difference between System.Pos and PosEx_Sha_Pas_2, and my own was somewhere inbetween. In Win64, PosEx_Sha_pas_2 was faster than the rest. Here, in a real Win7, the Fastcode Pzurepascal routine was fastest overall, in Win32 and Win64. – Rudy Velthuis Jan 09 '14 at 16:54
  • This is still very slow compared to the pure asm 32-bit version (Tokyo 10.2/W10) – hikari Jun 09 '17 at 19:02
  • @hikari, "very slow" is not what others and myself have found in tests. – LU RD Jun 09 '17 at 19:36
  • I have an app doing thousands of string operations per second, compiling to x64 using this function is roughly around half as fast. While on smaller tests it performed better it wasn't the case with my app. I'll try to prepare a demo – hikari Jun 09 '17 at 19:46
  • nvm that, it's around 15-20% slower than the 32-bit counterpart (around the same as System.Pos, looks like it was massively improved since the original question), there may be other parts of my app which run much slower in 64-bit other than the Pos calls. I would consider 15-20% still "very slow/er" though as compared to the 32-bit code. – hikari Jun 09 '17 at 21:11
  • Ok, good to know. Anyway it would have been a slam dunk for Emba to incorporate this code into the RTL for a purepascal replacement. – LU RD Jun 09 '17 at 21:15
  • Actually: removing the inlining (inline;) improved the call a lot, coming at around 6-7% slower than 32-bit with the PosEx_Sha_Pas_2 – hikari Jun 09 '17 at 21:18
  • It seems using AnsiString also has a major (detrimental) performance impact in general string manipulation routines in x64 vs x86 – hikari Jun 10 '17 at 00:17
  • 1
    @hikari, to boost the performance in 64 bit for AnsiStrings, use the function above. Only change the `String` to `AnsiString` and `Char` to `AnsiChar` and `PChar` to `PAnsiChar`. About 50% faster for AnsiStrings compared to the Strings purepascal PosEx_Sha_Pas_2 routine :-) – LU RD Jun 12 '17 at 20:57
7

The C runtime function wcsstr as implemented in the system provided msvcrt.dll is better than Delphi RTL Pos for 64 bit code.

{$APPTYPE CONSOLE}

uses
  SysUtils, Diagnostics;

function wcsstr(const str, strsearch: PWideChar): PWideChar; cdecl; external 'msvcrt.dll';

var
  i, j: Integer;
  Stopwatch: TStopwatch;
  str: string;
  P: PWideChar;

const
  N = 50000000;

begin
  str := 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do '
    + 'eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim '
    + 'ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut '
    + 'aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit '
    + 'in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur '
    + 'sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt '
    + 'mollit anim id est laborum.';

  Stopwatch := TStopwatch.StartNew;
  for i := 1 to N do
  begin
    j := Pos('tempor', str);
    if j=0 then
      Beep;
  end;
  Writeln('Pos: ' + IntToStr(Stopwatch.ElapsedMilliseconds));

  Stopwatch := TStopwatch.StartNew;
  for i := 1 to N do
  begin
    P := wcsstr(PChar(str), 'tempor');
    if P=nil then
      Beep;
  end;
  Writeln('wcsstr: ' + IntToStr(Stopwatch.ElapsedMilliseconds));

  Readln;
end.

32 bit release build

Pos: 1930
wcsstr: 6951

64 bit release build

Pos: 18384
wcsstr: 6701

Interestingly, the 32 bit Pos is clearly very well implemented.

My system is running Win7 x64 on i5-2300.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490