Quick (sub)string search in large set of data

Question

Given a city:

public class City
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string Country { get; set; }
    public LatLong Location { get; set; }
}

I have a list of close to 3,000,000 cities (and towns and villages etc.) in a file. This file is read into memory; I have been playing with arrays, lists, dictionaries (key = Id) etc.

I want to find, as quick as possible, all cities matching a substring (case insensitive). So when I search for 'yor' I want to get all matches (1000+) ASAP (matching 'York Town', 'Villa Mayor', 'New York', ...).

Functionally you could write this as:

cities.Values.Where(c => c.Name.IndexOf("yor", StringComparison.OrdinalIgnoreCase) >= 0)

I don't mind doing some pre-processing when reading the file; as a matter of fact: that's what I'm mostly looking for. Read the file, "chew" on the data creating some sort of index or... and then be ready to answer queries like "yor".

I want this to be standalone, self-contained. I do not want to add dependencies like an RDBMS, ElasticSearch or whatever. I don't mind having (parts of) the list in memory more than once. I don't mind spending some memory on a datastructure to help me find my results quickly. I don't want libraries or packages. I want an algorithm I can implement myself.

Basically I want the above LINQ statement, but optimized for my case; currently plowing through almost 3,000,000 records takes about +/- 2 seconds. I want this sub 0.1 second so I could use the search and it's results as 'autocomplete'.

Creating an "index"(-alike) structure is probably what I need. As I'm writing I remember something about a "bloom filter" but I'm not sure if that would help or even supports substring search. Will look into that now.

Any tips, pointers, help very much appreciated.

Maybe a Boyer-Moore implementation ? https://github.com/likejazz/boyer-moore-string-search — Jonathan Hamel, Aug 27 '18 at 14:26
@JonathanHamel I'm assuming `IndexOf` does a Boyer-Moore or similar. My problem is that even when using a Boyer-Moore (alike) I'd have to go over all 3M strings. I was hoping to reduce the search-space drastically with some data-structure / algorithm. If I had the luxury of a "Starts-with" search I could implement a binary-search. But I want to match on strings anywhere in the target strings ("contains" search). I'm happy to "pay" for that upfront by creating some lookup or whatever after / during reading the file. As long as it improves my query responsetime. — user10280259, Aug 27 '18 at 14:30
You might want to construct a [trie](https://en.wikipedia.org/wiki/Trie) — juharr, Aug 27 '18 at 14:34
If you don't mind pummeling your memory, why not generate every combination of 3 letters and index their position in the file? — Guillaume CR, Aug 27 '18 at 14:34
Have you considered text-search libraries like https://lucenenet.apache.org/? — SaiBot, Aug 27 '18 at 14:41
@SaiBot What wasn't clear about the _"I want this to be standalone, self-contained. I do not want to add dependencies like an RDBMS, ElasticSearch or whatever. [...] I want an algorithm I can implement myself."_ paragraph? — user10280259, Aug 27 '18 at 14:45
@GuillaumeCR That would be a (very expensive) option but would surely help. I would have a list of ID's for every three-letter combination in memory with matches and expand the search from there with simpler algoritms (hell, maybe even linear from there on). I haven't tested it but I'm afraid the memory usage would explode though (and yes, I said I don't mind spending some memory, but I'm afraid this would be a bit too much; I'll have to look into it). — user10280259, Aug 27 '18 at 14:48
@user10280259 sorry, I somehow missed that part. In this case you want to implement one of the [existing substring indexes](https://en.wikipedia.org/wiki/Substring_index) — SaiBot, Aug 27 '18 at 14:49
@SaiBot Again; I would still have to iterate over 3,000,000 cities for matches (unless I'm understanding you wrong). I'm looking to reduce the search-space before even thinking about optimizing substring searches for single strings. Also; no need to say sorry. You help is appreciated anyway! — user10280259, Aug 27 '18 at 15:01
@user10280259 I am not talking about optimizing substring search within a city name. I am talking about indexing all city names in a substring index. For example you could concatenate all city names (with a unique seperator symbol) and then construct a suffix array. Or you construct an N-gram index where each N-gram points to all city names where the N-gram is contained. — SaiBot, Aug 27 '18 at 15:06
Some quick napkin math shows that you could have a dictionary with 26^3 entries = 17576 entries. The vast majority of those entries will have an empty list, with some entries having possible hundreds of thousands of pointers. We're talking about a few gigs here, this is entirely possible. You can drastically reduce the size of each bucket by going to 4 characters too. — Guillaume CR, Aug 27 '18 at 15:10
@SaiBot I'll look into it, but I don't see how a big string of all concatenated citynames would help. I would surely be able to quickly tell IF a string is somewhere in the 3M cities, but how would I get back to the actual City objects? Where would the ID's be stored to be related to the cityname in that blob of text? — user10280259, Aug 27 '18 at 15:13
@GuillaumeCR I'm currently looking into it. One of your assumptions (26^3), however, is wrong. I'm not dealing with a western alphabet here but UTF-8 strings containing Chinese, Korean, Swahili and whathaveyou. However, that would only be an advantage creating more (but smaller!) buckets overall I think. Edit: Here are [all 332 first letters](https://pastebin.com/EGq9rm6z) (though I do notice a bit of 'garbage' like `’` and `;`). "Cities" are pulled from (a pre-filtered) [geonames](http://download.geonames.org/export/dump/)' allcountries.txt. — user10280259, Aug 27 '18 at 15:21
Well that definitely complicates things. However you can be smart about building your index, as for example it's unlikely you'll have cities with both Chinese and Latin characters. Use this knowledge to reduce the size of the index, and I think you're in business. — Guillaume CR, Aug 27 '18 at 15:26
@user10280259 A Suffix array will give you the positions of hits in your large file. To get the corresponding id you can either put it in front of each city (assuming city names do not contain numbers) or create an additional structure that gives you an id for each position in your large file(e.g., binary search on the start positions of the city names) — SaiBot, Aug 27 '18 at 15:31
2sec, debug or release? https://stackoverflow.com/questions/10519380/how-can-cs-string-indexof-perform-so-fast-10-times-faster-than-ordinary-for-l — Aldert, Aug 27 '18 at 16:07
Create a hashmap of vectors.Go over each sentence from dataset and for every 3 letter subsequence, add its index to the corresponding vector hashmap[xyz]. As for space, it shoulnd´t be much more than what you are already using to hold all the words — juvian, Aug 27 '18 at 17:34
@SaiBot I have [added an answer](https://stackoverflow.com/a/52057320/10280259) with a 'suffix array'-alike solution. Since you were the first to mention it but haven't posted an answer I will accept my own answer tomorrow (I can't currently) but I'll leave the credits in. Thanks! — user10280259, Aug 28 '18 at 12:39
@GuillaumeCR Now that I have implemented it: FYI; your estimate of 26^3 entries (= 17576) was off by about a factor of 4.3. There are 76172 entries in the dictionary. Just for shits 'n' giggles; you didn't know at the time I was dealing with more than the western alphabet so you **couldn't** make a realistic estimate. I just thought I might want to know. Thanks anyway! — user10280259, Aug 28 '18 at 12:46

user10280259 · Accepted Answer · 2018-08-29T23:31:09.443

I created a bit of a hybrid based on a suffix array / dictionary. Thanks to saibot for suggesting it first and all other people helping and suggesting.

This is what I came up with:

public class CitiesCollection
{
    private Dictionary<int, City> _cities;
    private SuffixDict<int> _suffixdict;

    public CitiesCollection(IEnumerable<City> cities, int minLen)
    {
        _cities = cities.ToDictionary(c => c.Id);
        _suffixdict = new SuffixDict<int>(minLen, _cities.Values.Count);
        foreach (var c in _cities.Values)
            _suffixdict.Add(c.Name, c.Id);
    }

    public IEnumerable<City> Find(string find)
    {
        var normalizedFind = _suffixdict.NormalizeString(find);
        foreach (var id in _suffixdict.Get(normalizedFind).Where(v => _cities[v].Name.IndexOf(normalizedFind, StringComparison.OrdinalIgnoreCase) >= 0))
            yield return _cities[id];
    }
}


public class SuffixDict<T>
{
    private readonly int _suffixsize;
    private ConcurrentDictionary<string, IList<T>> _dict;

    public SuffixDict(int suffixSize, int capacity)
    {
        _suffixsize = suffixSize;
        _dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity);
    }

    public void Add(string suffix, T value)
    {
        foreach (var s in GetSuffixes(suffix))
            AddDict(s, value);
    }

    public IEnumerable<T> Get(string suffix)
    {
        return Find(suffix).Distinct();
    }

    private IEnumerable<T> Find(string suffix)
    {
        foreach (var s in GetSuffixes(suffix))
        {
            if (_dict.TryGetValue(s, out var result))
                foreach (var i in result)
                    yield return i;
        }
    }

    public string NormalizeString(string value)
    {
        return value.Normalize().ToLowerInvariant();
    }

    private void AddDict(string suffix, T value)
    {
        _dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
    }

    private IEnumerable<string> GetSuffixes(string value)
    {
        var nv = NormalizeString(value);
        for (var i = 0; i <= nv.Length - _suffixsize ; i++)
            yield return nv.Substring(i, _suffixsize);
    }
}

Usage (where I assume mycities to be an IEnumerable<City> with the given City object from the question):

var cc = new CitiesCollection(mycities, 3);
var results = cc.Find("york");

Some results:

Find: sterda elapsed: 00:00:00.0220522 results: 32
Find: york   elapsed: 00:00:00.0006212 results: 155
Find: dorf   elapsed: 00:00:00.0086439 results: 6095

Memory usage is very, very acceptable. Only 650MB total having the entire collection of 3,000,000 cities in memory.

In the above I'm storing Id's in the "SuffixDict" and I have a level of indirection (dictionary lookups to find id=>city). This can be further simplified to:

public class CitiesCollection
{
    private SuffixDict<City> _suffixdict;

    public CitiesCollection(IEnumerable<City> cities, int minLen, int capacity = 1000)
    {
        _suffixdict = new SuffixDict<City>(minLen, capacity);
        foreach (var c in cities)
            _suffixdict.Add(c.Name, c);
    }

    public IEnumerable<City> Find(string find, StringComparison stringComparison = StringComparison.OrdinalIgnoreCase)
    {
        var normalizedFind = SuffixDict<City>.NormalizeString(find);
        var x = _suffixdict.Find(normalizedFind).ToArray();
        foreach (var city in _suffixdict.Find(normalizedFind).Where(v => v.Name.IndexOf(normalizedFind, stringComparison) >= 0))
            yield return city;
    }
}

public class SuffixDict<T>
{
    private readonly int _suffixsize;
    private ConcurrentDictionary<string, IList<T>> _dict;

    public SuffixDict(int suffixSize, int capacity = 1000)
    {
        _suffixsize = suffixSize;
        _dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity);
    }

    public void Add(string suffix, T value)
    {
        foreach (var s in GetSuffixes(suffix, _suffixsize))
            AddDict(s, value);
    }

    public IEnumerable<T> Find(string suffix)
    {
        var normalizedfind = NormalizeString(suffix);
        var find = normalizedfind.Substring(0, Math.Min(normalizedfind.Length, _suffixsize));

        if (_dict.TryGetValue(find, out var result))
            foreach (var i in result)
                yield return i;
    }

    private void AddDict(string suffix, T value)
    {
        _dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
    }

    public static string NormalizeString(string value)
    {
        return value.Normalize().ToLowerInvariant();
    }

    private static IEnumerable<string> GetSuffixes(string value, int suffixSize)
    {
        var nv = NormalizeString(value);
        if (value.Length < suffixSize)
        {
            yield return nv;
        }
        else
        {
            for (var i = 0; i <= nv.Length - suffixSize; i++)
                yield return nv.Substring(i, suffixSize);
        }
    }
}

This bumps the load time up from 00:00:16.3899085 to 00:00:25.6113214, memory usage goes down from 650MB to 486MB. Lookups/searches perform a bit better since we have one less level of indirection.

Find: sterda elapsed: 00:00:00.0168616 results: 32
Find: york elapsed: 00:00:00.0003945 results: 155
Find: dorf elapsed: 00:00:00.0062015 results: 6095

I'm happy with the results so far. I'll do a little polishing and refactoring and call it a day! Thanks everybody for the help!

And this is how it performs with 2,972,036 cities:

This has evolved into a case-insensitive, accent-insensitive search by modifying the code to this:

public static class ExtensionMethods
{
    public static T FirstOrDefault<T>(this IEnumerable<T> src, Func<T, bool> testFn, T defval)
    {
        return src.Where(aT => testFn(aT)).DefaultIfEmpty(defval).First();
    }

    public static int IndexOf(this string source, string match, IEqualityComparer<string> sc)
    {
        return Enumerable.Range(0, source.Length) // for each position in the string
                         .FirstOrDefault(i => // find the first position where either
                                              // match is Equals at this position for length of match (or to end of string) or
                             sc.Equals(source.Substring(i, Math.Min(match.Length, source.Length - i)), match) ||
                             // match is Equals to on of the substrings beginning at this position
                             Enumerable.Range(1, source.Length - i - 1).Any(ml => sc.Equals(source.Substring(i, ml), match)),
                             -1 // else return -1 if no position matches
                          );
    }
}

public class CaseAccentInsensitiveEqualityComparer : IEqualityComparer<string>
{
    private static readonly CompareOptions _compareoptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth | CompareOptions.IgnoreSymbols;
    private static readonly CultureInfo _cultureinfo = CultureInfo.InvariantCulture;
    public bool Equals(string x, string y)
    {
        return string.Compare(x, y, _cultureinfo, _compareoptions) == 0;
    }

    public int GetHashCode(string obj)
    {
        return obj != null ? RemoveDiacritics(obj).ToUpperInvariant().GetHashCode() : 0;
    }

    private string RemoveDiacritics(string text)
    {
        return string.Concat(
            text.Normalize(NormalizationForm.FormD)
            .Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
        ).Normalize(NormalizationForm.FormC);
    }
}

public class CitiesCollection
{
    private SuffixDict<City> _suffixdict;
    private HashSet<string> _countries;
    private Dictionary<int, City> _cities;
    private readonly IEqualityComparer<string> _comparer = new CaseAccentInsensitiveEqualityComparer();

    public CitiesCollection(IEnumerable<City> cities, int minLen, int capacity = 1000)
    {
        _suffixdict = new SuffixDict<City>(minLen, _comparer, capacity);
        _countries = new HashSet<string>();
        _cities = new Dictionary<int, City>(capacity);
        foreach (var c in cities)
        {
            _suffixdict.Add(c.Name, c);
            _countries.Add(c.Country);
            _cities.Add(c.Id, c);
        }
    }

    public City this[int index] => _cities[index];

    public IEnumerable<string> Countries => _countries;

    public IEnumerable<City> Find(string find, StringComparison stringComparison = StringComparison.OrdinalIgnoreCase)
    {
        foreach (var city in _suffixdict.Find(find).Where(v => v.Name.IndexOf(find, _comparer) >= 0))
            yield return city;
    }
}

public class SuffixDict<T>
{
    private readonly int _suffixsize;
    private ConcurrentDictionary<string, IList<T>> _dict;

    public SuffixDict(int suffixSize, IEqualityComparer<string> stringComparer, int capacity = 1000)
    {
        _suffixsize = suffixSize;
        _dict = new ConcurrentDictionary<string, IList<T>>(Environment.ProcessorCount, capacity, stringComparer);
    }

    public void Add(string suffix, T value)
    {
        foreach (var s in GetSuffixes(suffix, _suffixsize))
            AddDict(s, value);
    }

    public IEnumerable<T> Find(string suffix)
    {
        var find = suffix.Substring(0, Math.Min(suffix.Length, _suffixsize));

        if (_dict.TryGetValue(find, out var result))
        {
            foreach (var i in result)
                yield return i;
        }
    }

    private void AddDict(string suffix, T value)
    {
        _dict.AddOrUpdate(suffix, (s) => new List<T>() { value }, (k, v) => { v.Add(value); return v; });
    }

    private static IEnumerable<string> GetSuffixes(string value, int suffixSize)
    {
        if (value.Length < 2)
        {
            yield return value;
        }
        else
        {
            for (var i = 0; i <= value.Length - suffixSize; i++)
                yield return value.Substring(i, suffixSize);
        }
    }
}

With credit also to Netmage and Mitsugui. There are still some issues / edge-cases but it's continually improving!

score 1 · Answer 2 · answered Aug 28 '18 at 02:44

1

You could use a suffix tree: https://en.wikipedia.org/wiki/Suffix_tree

It requires enough space to store about 20 times your list of words in memory

Suffix array is a space efficient alternative: https://en.wikipedia.org/wiki/Suffix_array

answered Aug 28 '18 at 02:44

arenard

652
6
12

Thanks for your help! Please see [this comment](https://stackoverflow.com/questions/52041325/quick-substring-search-in-large-set-of-data#comment91066393_52041325) on why I won't accept your answer. I hope you understand! Your help and input is still appreciated though! – user10280259 Aug 28 '18 at 12:51

score -6 · Answer 3 · answered Aug 27 '18 at 14:33

-6

in query benchmark contains very faster then indexOf >0

cities.Values.Where(c => c.Name.Contans("yor"))

answered Aug 27 '18 at 14:33

Hamid Karimi

1

4

`Contains` _is_ `IndexOf` in disguise ([source](https://referencesource.microsoft.com/#mscorlib/system/string.cs,2173)) – FCin Aug 27 '18 at 14:40

Quick (sub)string search in large set of data

3 Answers3