1

What libraries provide case-insensitive, exact substring matching in Node.js against a large corpus of strings? I'm specifically looking for index-based solutions.

As an example, consider a corpus consists with millions of strings:

  • "Abc Gef gHi"
  • "Def Ghi xYz"

I need a library such that a search for "C ge" returns the first string above, but a search for "C ge" (note the multiple spaces) does not. In order words, I'm not looking for fuzzy, intelligent, full-text search with stemming and stop words; rather, the most simple (and fast) exact substring matcher with an index that works on a large scale.

Solutions in JavaScript are welcome, and so are solutions in C (as they can be turned into a native Node.js module). Alternatively, solutions in other programming languages such as Java are also possible; they can be used through the command-line. Preferably, solutions are disk-space-bound rather than memory-bound (e.g., rather not Redis), and they should write an index to disk so that subsequent startup time is low.

The problem with most solutions I found (such as the ones here), is that they are too intelligent. I.e., they apply different kinds of stemming or normalization, so the matches are not exact.

Thanks in advance for your help!

Community
  • 1
  • 1
Ruben Verborgh
  • 3,545
  • 2
  • 31
  • 43

3 Answers3

2

I'll list some of the solutions I found.

The most simple, but fitting would be https://github.com/martijnversluis/JsSuffixTrie

Then, more elaborate, hash based: https://github.com/fergiemcdowall/search-index

I can also suggest http://redis.io/. It's advanced, but still quite low-level. Not too many fancy packaging.

Finally, this blog post discusses tries in javascript, where the problem seems to be mostly loading time: http://ejohn.org/blog/javascript-trie-performance-analysis/

Mielvds
  • 31
  • 2
1

On the top of my head I can think of two possible solutions.

One is to use case-insensitive regex (having the string you search for (e.g. "C ge") being the regex) matching.

Another is to store an all lower (or upper) case copy of all strings and use those for the searching while returning the unmodified string. Of course the search-string need to be made all lower (or upper) case for this to work.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • Hi Joachim, I'm afraid that such a solution would not be fast enough. Executing a regex (or exact substring search in case of normalization) on a million strings is not scalable. I really need an index-based solution, for instance, using n-grams or something else. My question is whether there are libraries that do this (I will clarify this more). – Ruben Verborgh Feb 05 '15 at 12:17
  • @RubenVerborgh Regarding the normalization, you only need to do it *once*. Have two collections, one with the original string and one with the normalized string, where the indexes of both in the collections are the same. If the substring you search for is small, the overhead of normalizing that one is most likely negligible. – Some programmer dude Feb 05 '15 at 12:22
  • Sure, normalization doesn't introduce a lot of overhead; and that's fine. It's the search strategy I'm worried about. With your proposal, millions of strings need to be checked at every search; even if it's only exact comparison, that's still O(n) with n the number of strings in the corpus. – Ruben Verborgh Feb 05 '15 at 12:37
  • @RubenVerborgh If you have a lot of memory, you could use multiple [tries](https://en.wikipedia.org/wiki/Trie)? Normalized of course. :) – Some programmer dude Feb 05 '15 at 12:52
1

It depends of course on the size of your dataset and minimum response times.

For many use cases standard Unix tools such as sed and grep are pretty unbeatable when it comes to pattern matching.

Fergie
  • 5,933
  • 7
  • 38
  • 42