Delphi: efficient and fast Unicode text search

Question

Is there a fast and efficient text search in a Unicode text/string? I need to search a part of a word too, not only a whole word.

SearchBuf?

Thanks!

Which version of delphi? Are you tried using the `System.Pos` and `StrUtils.PosEx` functions? — RRUZ, Jul 20 '11 at 00:31
See here: http://stackoverflow.com/questions/3310865/is-there-a-boyer-moore-string-search-and-fast-search-and-replace-function-and-fas. — Rudy Velthuis, Jul 20 '11 at 00:43
Use a profiler and find out! :-) Remember that it's possible that what is faster for you (with your string values) may be slower for someone else. Perhaps one algorithm is faster on strings under 87 characters in length, and another one is logarithmically faster when strings exceed 88 characters, and gets even faster (relatively) than the other when the strings exceed 182 characters. Again, just pulling these out of a hat. Maybe percent characters will slow one algorithm down but not another. You have to test, to know. — Warren P, Jul 20 '11 at 02:19

PhiS · Accepted Answer · 2011-07-20T07:44:15.437

As has already been pointed out, the fastest way of doing this depends on a number of things, most importantly whether you need to be able to search repeatedly or not. The second question is how important is it to you to really have the "fastest" approach rather than a reasonably fast approach and the amount of time you are willing to invest in optimisations.

Repeated searches

If you need to search repeatedly, the most efficient way for string searching I know of is by the use of suffix arrays (often combined with Burrows-Wheeler transforms). This approach is used extensively in bioinformatics where one often has to deal with a huge number of string searches over really large data sets (e.g. here). A very good suffix array (and BWT) library is the libdivsufsort C library, but unfortunately I know of no Delphi port of this library. (I believe this library is capable of handling unicode strings.)

Single searches

If you don't need to search repeatedly, a brute-force string search algorithm can be efficient, for instance the assembly-optimised FastCode versions of Pos and friends. These were, however, written before Delphi was unicode-ified and I know of no similar optimised unicode implementations. If I were to write one today and wanted to optimise it for a modern processor (capable of the SSE4.2 instruction set), I would have a serious look at the PCMPESTRI assembly instruction (reference pdf here; see also e.g. here, but I have no idea whether that code is working), which can handle the 2-byte characters you'd need for unicode string searching.

+1 Because you speak about the BWT solution, which is the best I know around (there are several variations around, but they all have more or less the same premises). And the SSE4.2 solution is always tempting. — Arnaud Bouchez, Jul 20 '11 at 08:56

score 3 · Answer 2 · edited May 23 '17 at 12:26

From its implementation, SearchBuf is slower than Pos/PosEx. But it does have other options, like whole word search and case insensitive search.

For your purpose, the UnicodeString version of Pos is IMHO slower than PosEx (Pos use slowest rep scansw asm instead of two widechar-unrolled compare for PosEx - at least in Delphi 2010). Since I guess you'd like to have a start offset for your search (creating a sub-string for calling Pos is dead slow), you'd like to use PosEx.

A Bower-More algorithm could be probably faster. You have to try in your application, on real data, to guess if it's worth it.

And if you want to search for English text, using UTF-8 for your storage is worth making a try. The search will be faster, since you'll use less memory.

But I guess that the main bottleneck of your application will be in the storage / retrieval part (i.e. access to disk or un/compression), not in the search part. Use a software profiler to guess what is worth trying to enhance:

Make it right before you make it fast. Make it clear before you make it faster. Keep it right when you make it faster. — Kernighan and Plauger, Elements of Programming Style.

Premature optimization is the root of all evil. — Donald Knuth, quoting C. A. R. Hoare

The key to performance is elegance, not battalions of special cases. The terrible temptation to tweak should be resisted. — Jon Bentley and Doug McIlroy

The rules boil down to: "1. Don't optimize early. 2. Don't optimize until you know that it's needed. 3. Even then, don't optimize until you know what's needed, and where." — Herb Sutter

Just as a side note: I only ever use AnsiString searches, but for these, typically (not always) FastCode brute-force beats Boyer-Moore searches in my tests. — PhiS, Jul 20 '11 at 08:31
@Phis It's because the FastCode `PosEx()` version using AnsiString is faster (even the one used in Delphi 2010): it does use a DWORD aligned search, whereas the UnicodeString version as implemented in Delphi 2010 still use `WideChar` reading. That's why I wrote about using AnsiString is possible. And memory/disk usage will be much lower (twice smaller), therefore faster. For non English text (and even worse e.g. for Chinese), AnsiString will be a slower solution than using UnicodeString, because MBCS does have a cost. — Arnaud Bouchez, Jul 20 '11 at 08:46

Warren P · Answer 3 · 2011-07-20T16:47:44.803

One kind of fast and efficient search algorithm for some cases, is one that (if the data being searched is not changing, and you just search again and again on the same buffer content) searches once and builds some kind of lookup table. One possible solution is BoyerMoore (see link in comment by Rudy) but as that one didn't work great for really high range unicode characters (say range $0000...$FFFF), it was still faster than repeated Pos or SearchBuf calls, but it consumed a LOT of memory when I tested it with truly high-range Unicode data-sets (Chinese text for instance). My opinion is there is no "slam dunk" solution.

Perhaps you could write a faster-than-Pos-but-not-so-much-memory-as-BoyerMoore solution like this:

Build a table for all character points and store the first location that each character appears. I would use a sparse sorted array, for each search you have done, store each starting location for that initial character. That will spare you the brute force search through the big unicode string looking for the initial character matches.
If the table contains NO entry for a particular character, you have to do a brute force search (just once) to build that data set up. If you search once by brute force and codepoint U+1234 does not appear, then the entry for U+1234 should be an empty array [] indicating that you don't have to search again, and can quickly fail out on searches where the initial character doesn't match.
Once you have found a non-empty search initial-character list of positions like this [123,456,789], an initial codepoint match, you can continue by doing a character by character check at string[123...x] then string[456..x] and so on until you run out of elements in the array. Any mismatch causes a skip to the next element in the table of lookups.

Again there will be pathological cases where all this extra work results in code that is less fast than Pos is, and unless you can be sure you need to search for a lot of different substrings in exactly the same big string, then all this "optimization" stuff is a waste of time; forget it, even boyer moore string searches are slower, when you can't reuse the lookup data table multiple times at least.

If all you want is a Pos() function that is hand-optimized in Assembly to work as fast as is possible within the confines of a library function that doesn't produce or consume any intermediate table storage, then congratulations, Pos() is already probably about as fast as you can get (I believe it was optimized by the FastCode people a few years ago).

Your algorithm is intriguing... But I'm not sure that it will be faster. You are using a lot of branches and small allocations (nested lists) in it, whereas a plain brute force unrolled search version will probably be faster on modern CPU, with multiple pipelines. I'm not convinced the memory consumption of nested lists is a good speed solution. We'll have to find out on real data... Did you implement and test it? The BWT algorithm (from PhiS's answer) is a more proven algorithm. — Arnaud Bouchez, Jul 20 '11 at 08:54
Yeah, i'm interested in more formal algorithm description too. — Premature Optimization, Jul 20 '11 at 16:05
Essentially the payoff in this algorithm is ONLY applicable when you have a large static buffer, and you can avoid a brute force search by essentially building this pre-search table. It has worked for me in the case of text buffers (text editor programs, for instance) that change infrequently. It is probably not useful when the strings involved are short, and the searches:changes ratio is less than 10. — Warren P, Jul 20 '11 at 16:47
Essentially nothing is ever inherently ALWAYS FASTER, it's just possible that you can get faster MEASUREMENTS in SOME CASES. Even statements about fast CPUs and big caches and multiple pipelines, have enough dependencies and small print attached that the short answer is; There is no certain static analysis, the only answer that means anything is to test, and measure. — Warren P, Jul 20 '11 at 16:49

Delphi: efficient and fast Unicode text search

3 Answers3