1

I implemented an algorithm using dictionaries with tuple keys, and the algorithm works, but it is very very slow. I have a set of strings. I was trying to implement an associative matrix where A["abc","bcde"] = 2, the amount of overlap of the two strings. The tuples in L are keys in A. L is a sorted array => A[L[i]] < A[L[i+1]] I merge the two strings with the biggest overlap in the set, then I update the "matrix" and the L list. I do it in a loop until the set has only 1 element. My problem is that with dictionary the algorithm is too slow. Is there a more efficient way to do this? Here is my code:

List<string> words = new List<string>(wordsFromFile);

Dictionary<Tuple<string, string>, int> A = new Dictionary<Tuple<string, string>, int>();
List<Tuple<string, string>> L = new List<Tuple<string,string>>();

(I used counting sort for making L. After that refreshing the matrix and the list is very time consuming:)

            while (words.Count > 1)
            {
                string LastItem1 = L.Last().Item1;
                string LastItem2 = L.Last().Item2;
                words.Remove(LastItem1);
                words.Remove(LastItem2);
                string newElement = merge(LastItem1, LastItem2);
                words.Add(newElement);
                for (int i = 0; i < words.Count; ++i)
                {
                    if (words[i] == newElement)
                    {
                        Tuple<string, string> tmp = new Tuple<string, string>(newElement, newElement);
                        A[tmp] = 0;
                    }
                    else
                    {
                        Tuple<string, string> tmp = new Tuple<string, string>(newElement, words[i]);
                        A[tmp] = A[new Tuple<string, string>(LastItem2, words[i])];
                        tmp = new Tuple<string, string>(words[i], newElement);
                        A[tmp] = A[new Tuple<string, string>(words[i], LastItem1)];
                    }
                }
                var itemsToRemove = A.Where(f => f.Key.Item1 == LastItem1 || f.Key.Item1 == LastItem2 || f.Key.Item2 == LastItem1 || f.Key.Item2 == LastItem2).ToArray();
                foreach (var item in itemsToRemove)
                    A.Remove(item.Key);

                L.Remove(L.Last());
                for (int i = 0; i < L.Count(); ++i)
                {
                    if (L[i].Item1 == LastItem2 && L[i].Item2 != LastItem1 && L[i].Item2 != newElement && L[i].Item2 != LastItem2) L[i] = new Tuple<string, string>(newElement, L[i].Item2);
                    else if (L[i].Item2 == LastItem1 && L[i].Item1 != LastItem1 && L[i].Item1 != newElement && L[i].Item1 != LastItem2) L[i] = new Tuple<string, string>(L[i].Item1, newElement);
                }

                var listitemsToRemove = L.Where(f => f.Item1 == LastItem1 || f.Item2 == LastItem2 || f.Item1 == LastItem2 || f.Item2 == LastItem1).ToArray();
                foreach (var item in listitemsToRemove) L.Remove(item);
                listitemsToRemove = L.Where(f => f.Item2 == LastItem2).ToArray();

            }
Miklos
  • 101
  • 2
  • 11

1 Answers1

0
  1. Its hard to read highly obfuscated code, however one thing that jumps out at me is this:

    L[i].Item1

    Which is sub-optimal as compared to a dictionary. I imagine you might want to retain ordering, in which case you can use OrderedDictionary<>

  2. You use a for loop which could be optimized by a foreach loop in your cases. It is true that for loops are faster in raw performance but not in the way you are using it. You do about 12 look ups on L which is a list. Its not an array, its a list, so picking items in the middle of a list like that is going to lose speed over time. Foreach is optimized for this specific case and is faster head to head if iterating a list (unless you introduce an int counter in which case for loop is faster).

  3. words[i] is doing 3 lookups (inefficiently as compared to a foreach loop) where it would look it up once

Dessus
  • 2,147
  • 1
  • 14
  • 24
  • 1
    `List` basically is an array, perhaps you were confusing it with `LinkedList` ? Especially because you talk about "follow the linkedlist through memory". There is none here. – Ben Voigt May 02 '17 at 03:41
  • Your bullet #2 also seems to be discussing linked lists, even though it doesn't say so by name. – Ben Voigt May 02 '17 at 03:45
  • Its not, its referring to loop performance. I will try find a link to add to it. – Dessus May 02 '17 at 03:46
  • From research, it seems arrays are faster than lists in c# for index based lookups, which means it is worth not doing a lookup as if it were an array many times, might as well cache that result (will give some minor improvement). Foreach helps in achieving that but at the cost of 2 extra runtime variables. If you are using those extra vars then it can be adventagous (sometimes). – Dessus May 02 '17 at 03:58
  • Generally a for loop is faster but not always. In his case he will hit: words.Count being evaluated many times which also counts against his for loop. See here: http://stackoverflow.com/questions/15204950/c-sharp-for-vs-foreach-huge-performance-difference – Dessus May 02 '17 at 03:58
  • That question is about `ElementAt` and polymorphic access via `IEnumerable`, neither of which are relevant here – Ben Voigt May 02 '17 at 04:27
  • It also mentions .Count which is what I referred to. – Dessus May 02 '17 at 04:28
  • 1
    Not the same Count (The one here is the `System.Collections.Generic.List.Count` property, directly dispatched (and actually, inlined), the one you linked is the `System.Linq.Enumerable.Count()` extension method, polymorphically dispatched) – Ben Voigt May 02 '17 at 04:30
  • [Here's an answer covering array, List, for, and foreach](http://stackoverflow.com/a/454923/103167) – Ben Voigt May 02 '17 at 04:38