19

I need three fast-on-large-strings functions: fast search, fast search and replace, and fast count of substrings in a string.

I have run into Boyer-Moore string searches in C++ and Python, but the only Delphi Boyer-Moore algorithm used to implement fast search and replace that I have found is part of the FastStrings by Peter Morris, formerly of DroopyEyes software, and his website and email are no longer working.

I have already ported FastStrings forward to work great for AnsiStrings in Delphi 2009/2010, where a byte is equal to one AnsiChar, but making them also work with the String (UnicodeString) in Delphi 2010 appears non-trivial.

Using this Boyer-Moore algorithm, it should be possible to easily do case insensitive searches, as well as case-insensitive search and replace, without any temporary string (using StrUpper etc), and without calling Pos() which is slower than Boyer-Moore searching when repeated searches over the same text are required.

(Edit: I have a partial solution, written as an answer to this question, it is almost 100% complete, it even has a fast string replace function. I believe it MUST have bugs, and especially think that since it pretends to be Unicode capable that it must be that there are glitches due to unfulfilled Unicode promises. )

(Edit2: Interesting and unexpected result; The large stack size of a unicode code-point table on the stack - SkipTable in the code below puts a serious damper on the amount of win-win-optimization you can do here in a unicode string boyer-moore string search. Thanks to Florent Ouchet for pointing out what I should have noticed immediately.)

Warren P
  • 65,725
  • 40
  • 181
  • 316

2 Answers2

12

This answer is now complete and works for case sensitive mode, but does not work for case insensitive mode, and probably has other bugs too, since it's not well unit tested, and could probably be optimized further, for example I repeated the local function __SameChar instead of using a comparison function callback which would have been faster, and actually, allowing the user to pass in a comparison function for all these would be great for Unicode users who want to provide some extra logic (equivalent sets of Unicode glyphs for some languages).

Based on Dorin Dominica's code, I built the following.

{ _FindStringBoyer:
  Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
  Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
  const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;

    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if bCaseSensitive then
        Result := (sString[StringIndex] = sPattern[PatternIndex])
      else
        Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

var
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
begin
  if fromPos < 1 then
    raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
  LengthPattern := Length(sPattern);
  LengthString := Length(sString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[sPattern[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[sPattern[LengthPattern]];
  SkipTable[sPattern[LengthPattern]] := Large;
  Index := fromPos + LengthPattern -1;
  Result := 0;
  while Index <= LengthString do begin
    repeat
      Index := Index + SkipTable[sString[Index]];
    until Index > LengthString;
    if Index <= Large then
      Break
    else
      Index := Index - Large;
    kIndex := 1;
    while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
      Inc(kIndex);
    if kIndex = LengthPattern then begin
      // Found, return.
      Result := Index - kIndex + 1;
      Index := Index + LengthPattern;
      exit;
    end else begin
      if __SameChar(Index, LengthPattern) then
        Index := Index + LastMarker
      else
        Index := Index + SkipTable[sString[Index]];
    end; // if kIndex = LengthPattern then begin
  end; // while Index <= LengthString do begin
end;

{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
  a substring inside the main string, at a much faster rate than we
  could have done otherwise.  Another thing that would be great is
  to have a function that returns an array of find-locations,
  which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
  foundPos:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := 0;
  foundPos := 1;
  fromPos := 1;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin
        // Found, return.
        //Result := Index - kIndex + 1;
        Index := Index + LengthPattern;
        fromPos := Index;
        Inc(Result);
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin

  end;
end; 

This is really a nice Algorithm, because:

  • it's way faster to count instances of substring X in string Y this way, magnificently so.
  • For merely replacing Pos() the _FindStringBoyer() is faster than the pure asm version of Pos() contributed to Delphi by FastCode project people, that is currently used for Pos, and if you need the case-insensitivity, you can imagine the performance boost when we don't have to call UpperCase on a 100 megabyte string. (Okay, your strings aren't going to be THAT big. But still, Efficient Algorithms are a Thing of Beauty.)

Okay I wrote a String Replace in Boyer-Moore style:

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
  errors:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
  CaseSensitive:Boolean;
  foundAt:Integer;
  lastFoundAt:Integer;
  copyStartsAt:Integer;
  copyLen:Integer;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := '';
  lastFoundAt := 0;
  fromPos := 1;
  errors := 0;
  CaseSensitive := rfIgnoreCase in Flags;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    foundAt := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin


        foundAt := Index - kIndex + 1;
        Index := Index + LengthPattern;
        //fromPos := Index;
        fromPos := (foundAt+LengthPattern);
        if lastFoundAt=0 then begin
                copyStartsAt := 1;
                copyLen := foundAt-copyStartsAt;
        end else begin
                copyStartsAt := lastFoundAt+LengthPattern;
                copyLen := foundAt-copyStartsAt;
        end;

        if (copyLen<=0)or(copyStartsAt<=0) then begin
                Inc(errors);
        end;

        Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
        lastFoundAt := foundAt;
        if not (rfReplaceAll in Flags) then
                 fromPos := 0; // break out of outer while loop too!
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin
  end;
  if (lastFoundAt=0) then
  begin
     // nothing was found, just return whole original string
      Result := aSourceString;
  end
  else
  if (lastFoundAt+LengthPattern < Limit) then begin
     // the part that didn't require any replacing, because nothing more was found,
     // or rfReplaceAll flag was not specified, is copied at the
     // end as the final step.
    copyStartsAt := lastFoundAt+LengthPattern;
    copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
    Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
  end;

end;

Okay, problem: Stack footprint of this:

var
  skiptable : array [Char] of Integer;  // 65536*4 bytes stack usage on Unicode delphi

Goodbye CPU hell, Hello stack hell. If I go for a dynamic array, then I have to resize it at runtime. So this thing is basically fast, because the Virtual Memory system on your computer doesn't blink at 256K going on the stack, but this is not always an optimal piece of code. Nevertheless my PC doesn't blink at big stack stuff like this. It's not going to become a Delphi standard library default or win any fastcode challenge in the future, with that kinda footprint. I think that repeated searches are a case where the above code should be written as a class, and the skiptable should be a data field in that class. Then you can build the boyer-moore table once, and over time, if the string is invariant, repeatedly use that object to do fast lookups.

Warren P
  • 65,725
  • 40
  • 181
  • 316
  • In my testing I found that the relative speed of fastcode Pos vs. BM versions depends on quite a few parameters (lengths, alphabet sizes, repetitiveness, substring location) so that it's not easy to generalise; some asm/SIMD-based implementations that are typically significantly faster than either of the above are possible, but I don't have a unicode-ready version at hand. And yes, a 100 Mb string isn't so unlikely when you're dealing with genomes. But then, at this stage people start to use different algorithms altogether. – PhiS Jul 22 '10 at 17:42
  • Yep. But it would be nice if people could use efficient algorithms that are already well tested. Something like the above, when complete and tested, would be good to put into the JclStrings unit of the JCL library. Can you specify WHEN the fastcode Pos will be faster than a BoyerMoore algorithm? My tests are focusing on strings > 64K in length ONLY. – Warren P Jul 22 '10 at 18:02
  • Hand-optimized assembler versions of a boyer-moore type string replace that I have here are doing AnsiString search and replace about 30% faster on very large AnsiStrings. That it's THAT close is actually great, considering how much more "memory bandwidth" a giant UnicodeString needs. – Warren P Jul 22 '10 at 19:03
  • Alas no, my tests really don't constitute a sufficient sample of the parameter space for me to hazard a general prediction. What I can tell you is that for searches of random DNA substrings (~10-1000 char range) in a large random DNA string (Mb-Gb range) with randomised substring locations, FastCode Pos() tends to beat Boyer-Moore in the majority of cases. I got similar patterns with larger alphabet sizes also (e.g. 20-char alphabets or English language text). – PhiS Jul 22 '10 at 19:06
  • Oh good. So Boyer-Moore could be a "pessimization" for general everyday use. Which fits with my general mental model of the FastCode project people as "bright sparks" who wouldn't miss a trick like this, if it was always a win-win algorithm. Anybody who wants the FastStrings code tweaked-to-run-in-D2010, and the above code in a zip format, just email me. – Warren P Jul 22 '10 at 19:08
  • 2
    I have several working SSE##-based SIMD asm brute-force search versions that beat both BM and Pos() easily for my purposes; on new CPUs, SSE4.2-based routines utilising PCMPxSTRx (which can handle 2-byte characters) are a distinct option. Maybe something for another fastcode challenge? – PhiS Jul 22 '10 at 19:12
  • For large scale bioinformatics work where lots of short strings need to be repeatedly mapped to a larger reference string, people use e.g. suffix-array-based algorithms, often with Burrows-Wheeler transforms, which is great for both speed and memory use. – PhiS Jul 22 '10 at 19:15
  • Wow Dude! Care to provide a sample SIMD implementation? :-) – Warren P Jul 22 '10 at 19:41
  • 2
    If you started checking out Delphi's StringReplace versus the above, you'd see there is probably no question that Delphi's StringReplace needs to be nuked and replaced with something smarter. – Warren P Jul 22 '10 at 19:45
  • @Warren P, I will have to go through the code to get it up to a level for it to be fit for sharing, and will write more thorough validations. But once that's done, I'll be happy to share it. I only have AnsiString versions, though. – PhiS Jul 23 '10 at 06:52
  • 1
    I know this is pretty old, but thank you for writing the only Boyer-Moore implementation that I can find in Delphi that actually works, +1 – Seth Carnegie Dec 23 '11 at 05:04
  • Beware the high stack memory usage that can occur if you run this against some kinds of text.... (Imagine a chinese to english dictionary with 20,000 glyphs). – Warren P Dec 23 '11 at 14:11
  • Re the stack usage: Can you allocate it on the heap, once, the first time one of these methods is used, and free it when the program quits? Or is part of the speed dependent on it being located in the stack specifically? – David Sep 14 '14 at 18:19
  • This is really too much for the stack (default stack size is 1MB so a quarter of that is way too much). I would suggest to add a parameter where one may provide the memory or if nil the function does a getmem, freemem (so without memory initialization). – mrabat Sep 15 '14 at 07:04
  • On modern windows systems, I really doubt you could even measure the effect of moving from stack to heap. If anything, it would be microscopically slower. Most machines have 8+ gigabytes of ram. – Warren P Sep 15 '14 at 15:05
  • Or pherhaps that's an instance where a threadvar would make sense. – mrabat Sep 17 '14 at 09:41
  • **The code in this answer has several errors**. I had updated the answer with a first bug fix (28 aug), but I am currently tracing other bugs. I have removed my update again and will post a complety new answer if I have tested it enough. – Jan Doggen Sep 02 '15 at 12:09
  • **I have not succeeded in fixing the bug** in the time I am willing to spend. The function fails with case insensitive search. I will add a warning to the end of the answer. – Jan Doggen Sep 02 '15 at 14:17
  • The original question had no intention to support the case insensitive approach, and you retroactively adding it in is a mis-use of Stack overflow. I suggest you ask a new question if you want a CASE INSENSITIVE Boyer Moore. As far as I know, there is none. You just uppercase your inputs and uppercase your store. Don't amend a 4 year old question unless you wish to repair the code inline. Please try your edits again. – Warren P Sep 02 '15 at 20:01
  • I integrated your main fix into the main body of the answer Jan. Do not edit this code to add case insensitivity. – Warren P Sep 02 '15 at 20:14
  • I tested the above code in Delphi XE4 with `ShowMessage(_StringCountBoyer('Hello...Hello..Hello..Hello', 'Hello').ToString());` and result is 2. Is the code working? – RaelB May 28 '17 at 23:20
  • The case insensitive code above is indeed broken. Searches for strings like "Mike" "mike" "MIKE" yield different results as the skip table calcs do not factor in case insensitivity. I did find a solution here http://www.delphigeist.com/2010/04/boyer-moore-horspool-in-delphi-2010.html that appears to work in my limited testing. – Mike Apr 01 '18 at 18:49
2

Since I was just looking for the same: Jedi JCL has got a unicode aware search engine using Boyer-Moore in jclUnicode.pas. I have no idea how good or how fast it is yet.

dummzeuch
  • 10,975
  • 4
  • 51
  • 158