5

I am trying to improve performance on the following (sample) code.

Object[] inputKeys = new Object[10];
inputKeys[0] = "4021";
inputKeys[1] = "3011";
inputKeys[2] = "1010";
inputKeys[3] = "1020";
inputKeys[4] = "1030";

then the input keys are compared.

for (int i = 0; i < 5; i++)
{
    for (int j = 0; j < 5; j++)
    {
        bool result = inputKeys[i].Equals(inputKeys[j]);
    }
}

The inputKeys can be all of type string, int32 or DateTime.

There is a huge performance drop in .Equals line, when it hits millions of times.

Any suggestions on how to improve performance of this line (the equality check)?

I have tried this: Using an array of below class instead of the Object array to hold the keys. There I keep the Key type and key values.

public class CustomKey : IEquatable<CustomKey>{
    internal int KeyType { get; private set; }

    internal string ValueString { get; private set; }
    internal int ValueInteger { get; private set; }
    internal DateTime ValueDateTime { get; private set; }

    internal CustomKey(string keyValue)
    {
        this.KeyType = 0;
        this.ValueString = (string)keyValue;
    }

    internal CustomKey(int keyValue)
    {
        this.KeyType = 1;
        this.ValueInteger = (int)keyValue;
    }

    internal CustomKey(DateTime keyValue)
    {
        this.KeyType = 2;
        this.ValueDateTime = (DateTime)keyValue;
    }

    public bool Equals(CustomKey other)
    {
        if (this.KeyType != other.KeyType)
        {
            return false;
        }
        else
        {
            if (this.KeyType == 0)
            {
                return this.ValueString.Equals(other.ValueString);
            }
            else if (this.KeyType == 1)
            {
                return this.ValueInteger.Equals(other.ValueInteger);
            }
            else if (this.KeyType == 2)
            {
                return this.ValueDateTime.Equals(other.ValueDateTime);
            }
            else
            {
                return false;
            }
        }
    }
}

But the performance was worse.

Gayan Dasanayake
  • 1,933
  • 2
  • 17
  • 22
  • 3
    Your problem is the algorithm itself. You are comparing every item with every other item, which requires quadratic time. If you need to compare millions of items, you should find a better way to do so. One option -- not necessarily the best -- would be to divide the data by type and then sort it; this would make comparison trivial, and take only n log n time. – Thom Smith Dec 18 '12 at 17:58
  • How many distinct values do you expect there to be? If you're expecting, say, millions of items but only tens of thousands of values, then a simple hash table might do the trick. – Thom Smith Dec 18 '12 at 17:58
  • 3
    Impossible to answer. The best approah is to hit equals less often. It definitely does not get slower when it more often - I am sure a equals call takes the same time. The fundamental use seems to be badly choosen (example: check hashcode first, oder lists to make less equals calls etc.). These are "well known" Technologies for 50+ years (indices, database). At the end, the Problem is not the time for equals, but that you call it millions of times - inefficent algorithm. – TomTom Dec 18 '12 at 17:59
  • @TomTom is right. Don't waste your time rewriting Equals. The .Net Equals already handle comparison of differing types. Your version just does that work an additional time. Concentrate instead on the code surrounding the Equals. – hatchet - done with SOverflow Dec 18 '12 at 18:02

5 Answers5

2

Your comparison loop is inefficient. I suggest you to try to use:

Distinct<TSource>(this IEnumerable<TSource> source, IEqualityComparer<TSource> comparer)

Define your IEqualityComparer for that type and pass it to that method. You won't get a bool, but you're going to get an IEnumerable containing the list without duplicates.

e_ne
  • 8,340
  • 32
  • 43
  • Thanks, but when I run a profiler (ANTS) the drop is only on the equality check and negligible on the loop. Thats why I said I wish only to improve the equality. – Gayan Dasanayake Dec 18 '12 at 18:02
  • I agree it's not the Equals operation, its the way he's doing the looping. – Justin Dec 18 '12 at 18:02
  • Combine the use of HashCodes with an algorithm better than quadratic and it should perform way better. Also suggesting to read: http://stackoverflow.com/questions/11515991/best-algorithm-for-removing-duplicate-values-from-a-list – e_ne Dec 18 '12 at 18:06
2

As an example of algorithm efficiency, your first code could be rewritten

for (int i = 0; i < 5; i++)
{
    for (int j = i; j < 5; j++)
    {
        bool result = inputKeys[i].Equals(inputKeys[j]);
    }
}

because x.Equals(y) will give the same results as y.Equals, you don't need to check both ways. http://msdn.microsoft.com/en-us/library/ms173147(v=vs.80).aspx

The new implementation of Equals should follow all the guarantees of

x.Equals(y) returns the same value as y.Equals(x).

1

As said in the comments, the main burden of your algorithm is that you have to compare everything with everything, which is killing your performance. For 100K elements that means 100k^2 ... or about 10K million combinations... you can see where you have a problem. The best option is to revise the algorithm, however, if you're still determined or you don't have any other option consider:

Divide your objects FIRST, compare later:

Example: If you have 100K objects equaly distributed you'll have 33K ints, 33K strings and 33K datetimes which you can then compare which each other and ignore the combination between them.

100K^2 = 10K Million

(30K^2) * 3 = 2700 Million combination + 100K to order each element in its list

Extend your groups

If you don't care too much about memory you can hash the results to further refine your groups. Basically construct a grid... this is very specific depending on your problem.

The idea behind this would be to isolate things that can't really be equal, it's an extension of the previous idea but with more groups, the smaller the groups the fastest your performance

That way you can have 10 groups

  • Strings shorter than 5 characters
  • Strings between 5 and 50 characters
  • Strings longer than 50 characters

and so on...

If you redo the math (again, for an evenly distributed sample)

Total iterations = 10K^2 * 10 + 100K ~ 100 Million iterations(10 groups + the price of composing those groups)

Actual complexity = (n/m)^2 * m + n (where n = number of elements and m = number of groups assuming an even distribution.

Jorge Córdoba
  • 51,063
  • 11
  • 80
  • 130
0

Try grabbing the hash code for every object and comparing them with object.GetHashCode(). Not sure about the overhead of calling GetHashCode() a few million times, but comparing two ints is probably going to be a lot quicker than the Equals(object) method.

Robert Rouhani
  • 14,512
  • 6
  • 44
  • 59
0

use a hashtable (or better a Dictionary) to store your item. You approach has order of (N^2), by using hashtable, u can reduce the running time complexity to O(N) where N is the number.

To accomplish this, create a hashtable using hash key, if u get collision, add items to a linked list. When need only to check objects in the same buckets for equality which should not be too much.

I hope this is clear and helpful.

Khalefa
  • 2,294
  • 3
  • 14
  • 12