3

I am somewhat struggling with the terminology and complexity of my explanations here, feel free to edit it.

I have 1.000 - 20.000 objects. Each one can contain several name words (first, second, middle, last, title...) and normalized numbers(home, business...), email adresses or even physical adresses and spouse names.

I want to implement a search that enables users to freely combine word parts and number parts.
When I search for "LL 676" I want to find all objects that contain any String with "LL" AND "676".
Currently I am iterating over every object and every objects property, split the searchString on " " and do a stringInstance.Contains(searchword).

This is too slow, so I am looking for a better solution.

What is the appropriate language agnostic data structure for this?
In my case I need it for C#.

Is the following data structure a good solution?


It's based on a HashMap/Dictionary.
At first I create a String that contains all name parts and phone numbers I want to look through, one example would be: "William Bill Henry Gates III 3. +436760000 billgatesstreet 12":
Then I split on " " and for every word x I create all possible substrings y that fullfill x.contains(y). I put every of those substrings inside the hashmap/dictionary.
On lookup/search I just need to call the search for every searchword and the join the results. Naturally, the lookup speed is blazingly fast (native Hashmap/Dictionary speed).

EDIT: Inserts are very fast as well (insignificant time) now that I use a smarter algorithm to get the substrings.

ASA
  • 1,911
  • 3
  • 20
  • 37
  • You write: any string with "LL" AND "676". Do you actually mean 'and' there, or 'or? – Ben Aaronson Mar 15 '14 at 15:17
  • In my case I mean "and": Say you are looking for someone from company "Microsoft" and you know some other fragments of that persons data. Then you can filter for microsoft employees just by entering "microsoft" in the search box and further reduce the result set with fragments you know. My search isn't for users who want to explore but for users who somewhat know what they are looking for. If you are interested in the solution I choose take a look at the data structure I posted below. I fixed the exponential slowdown problem. Inserts are now incredibly fast even for very long Strings. – ASA Mar 15 '14 at 16:34

2 Answers2

1

It's possible I've misunderstood your algorithm or requirement, but this seems like it could be a potential performance improvement:

foreach (string arg in searchWords)
{
    if (String.IsNullOrEmpty(arg))
        continue;
    tempList = new List<T>();

    if (dictionary.ContainsKey(arg))
        foreach (T obj in dictionary[arg])
        if (list.Contains(obj))
                tempList.Add(obj);
    list = new List<T>(tempList);
}

The idea is that you do the first search word separately before this, and only put all the subsequent words into the searchWords list.

That should allow you to remove your final foreach loop entirely. Results only stay in your list as long as they keep matching every searchWord, rather than initially having to pile everything that matches a single word in then filter them back out at the end.

Ben Aaronson
  • 6,955
  • 2
  • 23
  • 38
  • I did a synthetic test and your improvement ended up making it ~ twice as fast when searching for two words. I guess the speed gain increases further the more search terms you use. Thanks. – ASA Mar 15 '14 at 17:47
  • @Traubenfuchs No problem. There's probably scope for further improvements with the kind of fiddly optimisations I'm not very good at, like avoiding doing that list copying. – Ben Aaronson Mar 15 '14 at 17:51
  • You don't actually need to copy the list. You can just go templist = list; Those aren't pointers. When both templist and list reference the same object and you call templist = new List(), then list doesn't get changed. It's so very fast already, beyond academic reasons I am honestly not really interested in further optimizing it. – ASA Mar 15 '14 at 17:53
  • @Traubenfuchs Ah, of course! I initially had the list management done slightly differently so I was getting myself confused. Another potential optimisation might be something like wrapping T in a class that contains a hashset of every search word, already separated. Then inside the loop from my answer, don't look in the dictionary at all and just check if the hashset contains arg for each word already in list. – Ben Aaronson Mar 15 '14 at 17:58
0

In case anyone cares for my solution:
Disclaimer:
This is only a rough draft.
I have only done some synthetic testing and I have written a lot of it without testing it again.
I have revised my code: Inserts are now ((n^2)/2)+(n/2) instead of 2^n-1 which is infinitely faster. Word length is now irrelevant.

namespace MegaHash
{
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Threading.Tasks;

public class GenericConcurrentMegaHash<T>
{
    // After doing a bulk add, call AwaitAll() to ensure all data was added!
    private ConcurrentBag<Task> bag = new ConcurrentBag<Task>();

    private ConcurrentDictionary<string, List<T>> dictionary = new ConcurrentDictionary<string, List<T>>();

    // consider changing this to include for example '-'
    public char[] splitChars;

    public GenericConcurrentMegaHash()
        : this(new char[] { ' ' })
    {
    }

    public GenericConcurrentMegaHash(char[] splitChars)
    {
        this.splitChars = splitChars;
    }

    public void Add(string keyWords, T o)
    {
        keyWords = keyWords.ToUpper();

        foreach (string keyWord in keyWords.Split(splitChars))
        {
            if (keyWord == null || keyWord.Length < 1)
                return;

            this.bag.Add(Task.Factory.StartNew(() => { AddInternal(keyWord, o); }));
        }
    }

    public void AwaitAll()
    {
        lock (this.bag)
        {
            foreach (Task t in bag)
                t.Wait();

            this.bag = new ConcurrentBag<Task>();
        }
    }

    private void AddInternal(string key, T y)
    {
        for (int i = 0; i < key.Length; i++)
        {
            for (int i2 = 0; i2 < i + 1; i2++)
            {
                string desire = key.Substring(i2, key.Length - i);

                if (dictionary.ContainsKey(desire))
                {
                    List<T> l = dictionary[desire];
                    lock (l)
                    {
                        try
                        {
                            if (!l.Contains(y))
                                l.Add(y);
                        }
                        catch (Exception ex)
                        {
                            ex.ToString();
                        }
                    }
                }
                else
                {
                    List<T> l = new List<T>();
                    l.Add(y);
                    dictionary[desire] = l;
                }
            }
        }
    }

    public IList<T> FulltextSearch(string searchString)
    {
        searchString = searchString.ToUpper();

        List<T> list = new List<T>();

        string[] searchWords = searchString.Split(splitChars);

        foreach (string arg in searchWords)
        {
            if (arg == null || arg.Length < 1)
                continue;

            if (dictionary.ContainsKey(arg))
                foreach (T obj in dictionary[arg])
                    if (!list.Contains(obj))
                        list.Add(obj);
        }

        List<T> returnList = new List<T>();

        foreach (T o in list)
        {
            foreach (string arg in searchWords)
                if (dictionary[arg] == null || !dictionary[arg].Contains(o))
                    goto BREAK;
            returnList.Add(o);
        BREAK:
            continue;
        }

        return returnList;
    }
}

}

ASA
  • 1,911
  • 3
  • 20
  • 37