Boyer-Moore Practical in C#?

Question

Boyer-Moore is probably the fastest non-indexed text-search algorithm known. So I'm implementing it in C# for my Black Belt Coder website.

I had it working and it showed roughly the expected performance improvements compared to String.IndexOf(). However, when I added the StringComparison.Ordinal argument to IndexOf, it started outperforming my Boyer-Moore implementation. Sometimes, by a considerable amount.

I wonder if anyone can help me figure out why. I understand why StringComparision.Ordinal might speed things up, but how could it be faster than Boyer-Moore? Is it because of the the overhead of the .NET platform itself, perhaps because array indexes must be validated to ensure they're in range, or something else altogether. Are some algorithms just not practical in C#.NET?

Below is the key code.

// Base for search classes
abstract class SearchBase
{
    public const int InvalidIndex = -1;
    protected string _pattern;
    public SearchBase(string pattern) { _pattern = pattern; }
    public abstract int Search(string text, int startIndex);
    public int Search(string text) { return Search(text, 0); }
}

/// <summary>
/// A simplified Boyer-Moore implementation.
/// 
/// Note: Uses a single skip array, which uses more memory than needed and
/// may not be large enough. Will be replaced with multi-stage table.
/// </summary>
class BoyerMoore2 : SearchBase
{
    private byte[] _skipArray;

    public BoyerMoore2(string pattern)
        : base(pattern)
    {
        // TODO: To be replaced with multi-stage table
        _skipArray = new byte[0x10000];

        for (int i = 0; i < _skipArray.Length; i++)
            _skipArray[i] = (byte)_pattern.Length;
        for (int i = 0; i < _pattern.Length - 1; i++)
            _skipArray[_pattern[i]] = (byte)(_pattern.Length - i - 1);
    }

    public override int Search(string text, int startIndex)
    {
        int i = startIndex;

        // Loop while there's still room for search term
        while (i <= (text.Length - _pattern.Length))
        {
            // Look if we have a match at this position
            int j = _pattern.Length - 1;
            while (j >= 0 && _pattern[j] == text[i + j])
                j--;

            if (j < 0)
            {
                // Match found
                return i;
            }

            // Advance to next comparision
            i += Math.Max(_skipArray[text[i + j]] - _pattern.Length + 1 + j, 1);
        }
        // No match found
        return InvalidIndex;
    }
}

EDIT: I've posted all my test code and conclusions on the matter at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

Are you completely excluding the possibility that Boyer-Moore is employed in .net internally? Just curious. — Nikita Rybak, Feb 05 '11 at 02:26
See http://stackoverflow.com/questions/2584169/what-algorithm-net-use-for-searching-a-pattern-in-a-string and the comments under the accepted answer in particular. — Ian Mercer, Feb 05 '11 at 02:26
@Nikita: No, and I spent that last 30 minutes trying to get .NET Reflector up after Red Gate has all but ruined it. However, it looks like it goes into a routine that .NET Reflector won't show. (Unmanaged code, perhaps?) But even if it did use BM, how could it be so much faster? I'm not even including the skip-array setup in my timing. Does unmanaged code offer that much performance benefit? — Jonathan Wood, Feb 05 '11 at 02:42
@Hightechrider: Thanks for the reference, however it wasn't that much help. They basically speculate about `IndexOf()` using Boyer-Moore and talk about issues with Unicode (which I've already developed code to correctly handle). — Jonathan Wood, Feb 05 '11 at 02:43
FYI, .NET is Microsoft's implementation of C#, the CLR, and the various libraries. Mono is the other major implementation. — Gabe, Feb 05 '11 at 04:52
@Jonathan - It also explains that IndexOf uses native/internal code (which you can also see using Reflector). Since the framework resorts to native code you can bet that they have a faster implementation than you do in your managed code. — Ian Mercer, Feb 05 '11 at 07:01
@Hightechrider: Yes, this seems to be the likely explanation. I spent a little time with Reflector. Although I don't understand it that well, I followed it down until it seemed to go to unmanaged code. It possibly uses hand-optimized assembly language, and `StringComparison.Ordinal` allows it to perform a direct comparision. — Jonathan Wood, Feb 05 '11 at 08:37
note that performance really matters you can use unsafe code and work using a `char*`. Some internal string routines do that. — CodesInChaos, Feb 06 '11 at 21:56

score 22 · Accepted Answer · edited Dec 08 '16 at 21:38

22

Based on my own tests and the comments made here, I've concluded that the reason String.IndexOf() performs so well with StringComparision.Ordinal is because the method calls into unmanaged code that likely employs hand-optimized assembly language.

I have run a number of different tests and String.IndexOf() just seems to be faster than anything I can implement using managed C# code.

If anyone's interested, I've written everything I've discovered about this and posted several variations of the Boyer-Moore algorithm in C# at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

edited Dec 08 '16 at 21:38

Chris Larabell

168
6

answered Feb 06 '11 at 21:43

Jonathan Wood

65,341
71
269
466

For others who may find it is useful, String.Contains() calls String.IndexOf(value, StringComparison.Ordinal) >= 0 . So the conclution is to use String.Conatinst() for string search – ValidfroM Sep 25 '13 at 20:58
This conclusion is data dependent. Boyer Moore can be arbitrarily faster on suitable data. – usr Dec 28 '15 at 21:06
@usr: Well, if you have evidence that it is faster in some case, please feel free to present that evidence. I have a thorough understanding of Boyer-Moore, and fully understand how it is faster for certain data, but I tested a variety of data and `String.IndexOf()` seemed faster pretty much consistently. – Jonathan Wood Dec 28 '15 at 21:11
@ValidfroM: `string.Contains()` does not tell you where the string is located. So it wouldn't be a good choice at all for string search. – Jonathan Wood Dec 10 '17 at 17:44

score 4 · Answer 2 · answered Feb 05 '11 at 02:27

4

My bet is that setting that flag allows String.IndexOf to use Boyer-Moore itself. And its implementation is better than yours.

Without that flag it has to be careful using Boyer-Moore (and probably doesn't) because of potential issues around Unicode. In particular the possibility of Unicode causes the transition tables that Boyer-Moore uses to blow up.

answered Feb 05 '11 at 02:27

btilly

43,296
3
59
88

Well, that would be a neat trick given that I'm using some pretty standard implementations. (They aren't mine, BTW.) However, if it runs in unmanaged code, that may offer some performance benefits. – Jonathan Wood Feb 05 '11 at 02:38
2

If the BCL is using a Boyer-Moore search, it should be detectable; searching for a long string ("abcdefghijklmnopqrstuvwxyz") should be measurably faster than searching for a short string ("a"). – Gabe Feb 05 '11 at 03:44
@Gabe: Good point, but it doesn't seem to be. It just seems fast no matter what. My Boyer-Moore routines, on the other hand, are definitely affected by the length of the search pattern. – Jonathan Wood Feb 05 '11 at 03:55
Hmmm, could it be .NET uses different algorithms for different cases? – Martheen Feb 05 '11 at 05:00
It's quite likely that the .Net implementation is mostly just a single instruction (`REP CMPSB`), making it go as fast as data can flow through the CPU. – Gabe Feb 05 '11 at 05:19
@Gabe: This is the type of stuff I've been wondering about. It seems likely that `IndexOf()` uses some optimizations that are not available to me as a C# programmer. It makes sense that `CMPSW` can only be used with `StringComparison.Ordinal`, which would explain why it's considerably slower in other comparison modes. – Jonathan Wood Feb 05 '11 at 08:32
3

There's no reason for you to have to **bet** here. Has **anyone** opened it up in Reflector and **checked**? – Billy ONeal Mar 09 '11 at 02:10
Actually, @billy-oneal, there is a good reason for me to have to bet. I don't use Windows or C# so testing directly is impractical for me. But I know about Boyer-Moore and its potential pitfalls from other languages in other environments. – btilly Mar 09 '11 at 03:09
@Gabe the `rep` instructions are by no means the fastest instructions. – user541686 Apr 18 '13 at 17:40
@Mehrdad: Feel free to substitute in whatever .Net actually uses. – Gabe Apr 20 '13 at 04:12

codeDom · Answer 3 · 2020-12-20T12:59:20.987

I made some small changes to your code, and made a different implementation to the Boyer-Moore algorithm and got better results. I got the idea for this implementation from here

But to be honest, I would expect to reach a higher speed compared to IndexOf.

class SearchResults
{
    public int Matches { get; set; }
    public long Ticks { get; set; }
}

abstract class SearchBase
{
    public const int InvalidIndex = -1;
    protected string _pattern;
    protected string _text;
    public SearchBase(string text, string pattern) { _text = text; _pattern = pattern; }
    public abstract int Search(int startIndex);
}

internal class BoyerMoore3 : SearchBase
{
    readonly byte[] textBytes;
    readonly byte[] patternBytes;
    readonly int valueLength;
    readonly int patternLength;
    private readonly int[] badCharacters = new int[256];
    private readonly int lastPatternByte;

    public BoyerMoore3(string text, string pattern) : base(text, pattern)
    {
        textBytes = Encoding.UTF8.GetBytes(text);
        patternBytes = Encoding.UTF8.GetBytes(pattern);
        valueLength = textBytes.Length;
        patternLength = patternBytes.Length;

        for (int i = 0; i < 256; ++i)
            badCharacters[i] = patternLength;

        lastPatternByte = patternLength - 1;

        for (int i = 0; i < lastPatternByte; ++i)
            badCharacters[patternBytes[i]] = lastPatternByte - i;
    }

    public override int Search(int startIndex)
    {
        int index = startIndex;

        while (index <= (valueLength - patternLength))
        {
            for (int i = lastPatternByte; textBytes[index + i] == patternBytes[i]; --i)
            {
                if (i == 0)
                    return index;
            }

            index += badCharacters[textBytes[index + lastPatternByte]];
        }

        // Text not found
        return InvalidIndex;
    }
}

Changed code from Form1:

    private void RunSearch(string pattern, SearchBase search, SearchResults results)
    {
        var timer = new Stopwatch();

        // Start timer
        timer.Start();

        // Find all matches
        int pos = search.Search(0);
        while (pos != -1)
        {
            results.Matches++;
            pos = search.Search(pos + pattern.Length);
        }

        // Stop timer
        timer.Stop();

        // Add to total Ticks
        results.Ticks += timer.ElapsedTicks;
    }

It looks like you only have 256 entries in your skip table. That's enough for ASCII, but not Unicode. — Jonathan Wood, Dec 20 '20 at 12:42
@JonathanWood Did you notice that the string is converted to a byte array by UTF8? — codeDom, Dec 20 '20 at 12:56
@JonathanWood and see [here](https://stackoverflow.com/a/6067955/7206675) — codeDom, Dec 21 '20 at 20:01

Boyer-Moore Practical in C#?

3 Answers3

Linked