10

I have a text file with 100000 pairs: word and frequency.

test.in file with words:

  • 1 line - total count of all word-frequency pairs
  • 2 line to ~100 001 - word-frequency pairs
  • 100 002 line - total count of user input words
  • from 100 003 to the end - user input words

I parse this file and put the words in

Dictionary<string,double> dictionary;

And I want to execute some search + order logic in the following code:

for(int i=0;i<15000;i++)
{
    tempInputWord = //take data from file(or other sources)

    var adviceWords = dictionary
                .Where(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
                .OrderByDescending(ks => ks.Value)
                .ThenBy(ks => ks.Key,StringComparer.Ordinal)
                .Take(10)
                .ToList();

    //some output
}

The problem: This code must run in less than 10 seconds.

On my computer (core i5 2400, 8gb RAM) with Parallel.For() - about 91 sec.

Can you give me some advice how to increase performance?

UPDATE :

Hooray! We did it! Thank you @CodesInChaos, @usr, @T_D and everyone who was involved in solving the problem.

The final code:

var kvList = dictionary.OrderBy(ks => ks.Key, StringComparer.Ordinal).ToList();

var strComparer = new MyStringComparer();
var intComparer = new MyIntComparer();
var kvListSize = kvList.Count;
var allUserWords = new List<string>();
for (int i = 0; i < userWordQuantity; i++)
{
    var searchWord = Console.ReadLine();
    allUserWords.Add(searchWord);
}
var result =  allUserWords
    .AsParallel()
    .AsOrdered()
    .Select(searchWord =>
    {
        int startIndex = kvList.BinarySearch(new KeyValuePair<string, int>(searchWord, 0), strComparer);
        if (startIndex < 0)
            startIndex = ~startIndex;

        var matches = new List<KeyValuePair<string, int>>();

        bool isNotEnd = true;
        for (int j = startIndex; j < kvListSize ; j++)
        {
            isNotEnd = kvList[j].Key.StartsWith(searchWord, StringComparison.Ordinal);
            if (isNotEnd) matches.Add(kvList[j]);
            else break;
        }
        matches.Sort(intComparer);

        var res = matches.Select(s => s.Key).Take(10).ToList();

        return res;
    });
foreach (var adviceWords in result)
{
   foreach (var adviceWord in adviceWords)
   {
       Console.WriteLine(adviceWord);
   }
   Console.WriteLine();
}

6 sec (9 sec without manual loop (with linq)))

chromigo
  • 1,134
  • 1
  • 15
  • 25
  • Consider that by using `Where` as your first operation you are already resigning yourself to an O(n) algorithm at best. Not to mention that `Console.ReadLine()` is not cheap. – Dan Oct 01 '14 at 09:27
  • 7
    How do you do Parallel "Console.Readline" 15000 times in 37 seconds? I sure can't type that fast. – Ira Baxter Oct 01 '14 at 09:27
  • 1
    What about a compiled regular expression? Then just iterate the matches. – Adam Houldsworth Oct 01 '14 at 09:28
  • 1
    I can't see what that for (parallel or not) is for. You just want to repeat same search many times? – Adriano Repetti Oct 01 '14 at 09:30
  • 2
    ... assuming you are really reading from a file instead of getting console input, you may be serializing on access to the file. – Ira Baxter Oct 01 '14 at 09:31
  • @AdrianoRepetti: He appears to be doing a different search on different input for each iteration – Ira Baxter Oct 01 '14 at 09:32
  • I don't think reading from console in parallel is a good idea. Do that in a separate step. – Pieter21 Oct 01 '14 at 09:32
  • 2
    @chromigo A complete front-to-back example will help a lot here... – Adam Houldsworth Oct 01 '14 at 09:33
  • @IraBaxter yes but then performance comparison is meaningless here and code should be trimmed to loop body only... – Adriano Repetti Oct 01 '14 at 09:34
  • @IraBaxter i run this code from unit test with this trick using (StreamReader sr = new StreamReader("test.in")) {... Console.SetIn(sr); ...} – chromigo Oct 01 '14 at 10:18
  • 1
    The title of the question isn't very specific btw. Could be improved ;) – T_D Oct 01 '14 at 10:56
  • @chromigo Please avoid using tricks which make your code surprising. Someone else might have to maintain it someday, or you yourself might forget which tricks you used where. And it sure sidetracks the SO commenters. EDIT: Ah, it's for a unit test. So the loop is part of the unit test? Then maybe you just need to take that part out of the question body, since it's not part of the algorithm you want to optimize. – Keen Oct 01 '14 at 14:00
  • 1
    @Cory Unfortunately-no In task was written(translate from other language): "Input data is supplied program as standard input (keyboard input). The program should output the response to the standard output (display). • The solution must work quickly (no more than 1-10 seconds) on a test file test.in. ..." May be it's stupid, but it is. That why I am using unit test with fake source(i guess that task will checked the same way). And loop is a part of logic(Console.Readline() take user input (or test file)) – chromigo Oct 01 '14 at 17:33
  • 2
    If you provide the file with the value pairs and the list of searchWords that are used for testing, we can test our ideas – T_D Oct 01 '14 at 19:46
  • @T_D I added this file in original post with short description(update 3, end) – chromigo Oct 02 '14 at 16:11
  • 3
    This question appears to be off-topic because it is asking for a code review. See [What topics can I ask about here](http://stackoverflow.com/help/on-topic) in the Help Center. Perhaps [Code Review Stack Exchange](http://codereview.stackexchange.com/) would be a better place to ask. – jww Oct 03 '14 at 04:56
  • Trie seems to be an easy and efficient way to start. Did you tried to profile your code to find where are bottlenecks with different implementations ? It could be really helpful, sometimes it gives real surprises about where are the flaws, especially on this type of calculations which may rely on a bit of magic (complex collections types, LINQ, behind-the-scene CLR behaviors, etc). – AFract Oct 03 '14 at 14:34

5 Answers5

9

You are not at all using any algorithmic strength of the dictionary. Ideally, you'd use a tree structure so that you can perform prefix lookups. On the other hand you are within 3.7x of your performance goal. I think you can reach that by just optimizing the constant factor in your algorithm.

  1. Don't use LINQ in perf-critical code. Manually loop over all collections and collect results into a List<T>. That turns out to give a major speed-up in practice.
  2. Don't use a dictionary at all. Just use a KeyValuePair<T1, T2>[] and run through it using a foreach loop. This is the fastest possible way to traverse a set of pairs.

Could look like this:

KeyValuePair<T1, T2>[] items;
List<KeyValuePair<T1, T2>> matches = new ...(); //Consider pre-sizing this.

//This could be a parallel loop as well.
//Make sure to not synchronize too much on matches.
//If there tend to be few matches a lock will be fine.
foreach (var item in items) {
 if (IsMatch(item)) {
  matches.Add(item);
 }
}

matches.Sort(...); //Sort in-place

return matches.Take(10); //Maybe matches.RemoveRange(10, matches.Count - 10) is better

That should exceed a 3.7x speedup.

If you need more, try stuffing the items into a dictionary keyed on the first char of Key. That way you can look up all items matching tempInputWord[0]. That should reduce search times by the selectivity that is in the first char of tempInputWord. For English text that would be on the order of 26 or 52. This is a primitive form of prefix lookup that has one level of lookup. Not pretty but maybe it is enough.

usr
  • 168,620
  • 35
  • 240
  • 369
  • 1
    It is even easier. He is RESORTING the keys in the dictionary. On every run. Twice (order by then by). No idea why he even considered choosing a dictionary - totally unsuitable selection. – TomTom Oct 01 '14 at 09:38
  • 1
    He's only sorting the matches. Hopefully, there are only a few. – usr Oct 01 '14 at 09:40
  • 1
    Don't use LINQ in perf-critical code. Manually loop over all collections and collect results into a List. That turns out to give a major speed-up in practice. -- do you have anything to back this up? That sounds very dubious. – Bryan Boettcher Oct 01 '14 at 16:27
  • 1
    @insta LINQ calls through a delegate/vcall three times for each item. Modern CPUs like to read ahead in the instruction stream. They can't do that with indirect calls. LINQ is very slow if the work items are tiny. Try `Enumerable.Range(0, 1000000).Sum()` vs. a loop. Probably 3-10x difference. This is rather well-known. – usr Oct 02 '14 at 10:17
4

I think the best way would be to use a Trie data structure instead of a dictionary. A Trie data structure saves all the words in a tree structure. A node can represent all the words that start with the same letters. So if you look for your search word tempInputWord in a Trie you will get a node that represents all the words starting with tempInputWord and you just have to traverse through all the child nodes. So you just have one search operation. The link to the Wikipedia article also mentions some other advantages over hash tables (that's what an Dictionary is basically):

  • Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
  • There are no collisions of different keys in a trie.
  • Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
  • There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
  • A trie can provide an alphabetical ordering of the entries by key.

And here are some ideas for creating a trie in C#.

This should at least speed up the lookup, however, building the Trie might be slower.

Update: Ok, I tested it myself using a file with frequencies of english words that uses the same format as yours. This is my code which uses the Trie class that you also tried to use.

    static void Main(string[] args)
    {
        Stopwatch sw = new Stopwatch();

        sw.Start();
        var trie = new Trie<KeyValuePair<string,int>>();

        //build trie with your value pairs
        var lines = File.ReadLines("en.txt");
        foreach(var line in lines.Take(100000))
        {
            var split = line.Split(' ');
            trie.Add(split[0], new KeyValuePair<string,int>(split[0], int.Parse(split[1])));
        }

        Console.WriteLine("Time needed to read file and build Trie with 100000 words: " + sw.Elapsed);

        sw.Reset();

        //test with 10000 search words
        sw.Start();
        foreach (string line in lines.Take(10000))
        {
            var searchWord = line.Split(' ')[0];
            var allPairs = trie.Retrieve(searchWord);
            var bestWords = allPairs.OrderByDescending(kv => kv.Value).ThenBy(kv => kv.Key).Select(kv => kv.Key).Take(10);

            var output = bestWords.Aggregate("", (s1, s2) => s1 + ", " + s2);
            Console.WriteLine(output);

        }

        Console.WriteLine("Time to process 10000 different searchWords: " + sw.Elapsed);
    }

My results on a pretty similar machine:

Time needed to read file and build Trie with 100000 words: 00:00:00.7397839
Time to process 10000 different searchWords: 00:00:03.0181700

So I think you are doing something wrong that we cannot see. For example the way you measure the time or the way you read the file. As my results show this stuff should be really fast. The 3 seconds are mainly due to the Console output in the loop which I needed so that the bestWords variable is used. Otherwise the variable would have been optimized away.

Community
  • 1
  • 1
T_D
  • 1,688
  • 1
  • 17
  • 28
  • Nice solution, I will try to implement him and write about result later. – chromigo Oct 01 '14 at 10:42
  • I tried this variant, but failed. Maybe I did something wrong. I updated my post (update 2, details there) . – chromigo Oct 01 '14 at 19:01
  • I need some clarification: Which operations do you measure the time of? Is it just the lookup of the best 10 words or do you also include the creation of the Dictionary/Trie? As I said the creation of the Trie should be slower than the creation of the dictionary. With a Trie you shouldn't need the Dictionary anymore but I still see it in your code of Update 2. – T_D Oct 01 '14 at 19:10
  • I updated my post with an example implementation. It works pretty fast for me. Hope it helps. – T_D Oct 01 '14 at 20:56
  • I added test.in file (and project with Trie and etc). With this dataset I got slower result. – chromigo Oct 02 '14 at 23:45
  • 1
    Ok, I guess I found the bottleneck: the allPairs variable can be huge and the Ordering of these huge collections is the problem. I think you need to use an optimized algorithm here. So have a look here: http://stackoverflow.com/questions/4956593/optimal-algorithm-for-returning-top-k-values-from-an-array-of-length-n – T_D Oct 03 '14 at 09:42
2
  • Replace the dictionary by a List<KeyValuePair<string, decimal>>, sorted by the key.

    For the search I use that a substring sorts directly before its prefixes with ordinal comparisons. So I can use a binary search to find the first candidate. Since the candidates are contiguous I can replace Where with TakeWhile.

    int startIndex = dictionary.BinarySearch(searchWord, comparer);
    if(startIndex < 0)
        startIndex = ~startIndex;
    
    var adviceWords = dictionary
                .Skip(startIndex)
                .TakeWhile(p => p.Key.StartsWith(searchWord, StringComparison.Ordinal))
                .OrderByDescending(ks => ks.Value)
                .ThenBy(ks => ks.Key)
                .Select(s => s.Key)
                .Take(10).ToList();
    
  • Make sure to use ordinal comparison for all operations, including the initial sort, the binary search and the StartsWith check.

  • I would call Console.ReadLine outside the parallel loop. Probably using AsParallel().Select(...) on the collection of search words instead of Parallel.For.
CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • Good advice, but may be you skip ".Skip(startIndex)" before ".TakeWhile(...)" – chromigo Oct 01 '14 at 16:17
  • I tried your solution and write result in top of post – chromigo Oct 01 '14 at 17:15
  • 1
    @chromigo 1) Yeah, forgot to include `Skip`. 2) You misunderstood my suggestion concerning `AsParallel`. I recommend first reading all search words into a collection. Then call `AsParallel().Select(individualSearch)` or parallel foreach on it. Don't try to parallelize an individual query. – CodesInChaos Oct 02 '14 at 08:46
  • in this case, if I understood correctly, I can get incorrect order of element in output – chromigo Oct 03 '14 at 10:08
  • @chromigo In that case, add `AsOrdered`. – CodesInChaos Oct 03 '14 at 16:02
1

If you want profiling, separate the reading of the file and see how long that takes. Also data calculation, collection, presentation could be different steps.

If you want concurrence AND a dictionary, look at the ConcurrentDictionary, maybe even more for reliability than for performance, but probably for both:

http://msdn.microsoft.com/en-us/library/dd287191(v=vs.110).aspx

Pieter21
  • 1,765
  • 1
  • 10
  • 22
  • He's only doing reads of the dictionary in parallel. If it isn't implemented stupidly, that's likely to be safe, and he doesn't need to pay the overhead of concurrent dictionary internal locks. – Ira Baxter Oct 01 '14 at 09:43
0

Assuming the 10 is constant, then why is everyone storing the entire data set? Memory is not free. The fastest solution is to store the first 10 entries into a list, sort it. Then, maintain the 10-element-sorted-list as you traverse through the rest of the data set, removing the 11th element every time you insert an element.

The above method works best for small values. If you had to take the first 5000 objects, consider using a binary heap instead of a list.

KevinZ
  • 3,036
  • 1
  • 18
  • 26