6

I have an array of custom objects named AnalysisResult. The array can contain hundreds of thousands of objects; and, occasionally I need only the Distinct() elements of that array. So, I wrote a item comparer class called AnalysisResultDistinctItemComparer and do my call like this:

public static AnalysisResult[] GetDistinct(AnalysisResult[] results)
{
    return results.Distinct(new AnalysisResultDistinctItemComparer()).ToArray();
}

My problem here is that this call can take a LONG time (on the order of minutes) when the array is particularly big (greater than 200,000 objects).

I currently call that method in a background worker and display a spinning gif to alert the user that the method is being performed and that the application has not frozen. This is all fine and well but it does not give the user any indication of the current progress.

I really need to be able to indicate to the user the current progress of this action; but, I have been unable to come up with a good approach. I was playing with doing something like this:

public static AnalysisResult[] GetDistinct(AnalysisResult[] results)
{
    var query = results.Distinct(new AnalysisResultDistinctItemComparer());

    List<AnalysisResult> retVal = new List<AnalysisResult>();
    foreach(AnalysisResult ar in query)
    {
        // Show progress here
        retVal.Add(ar);
    }

    return retVal.ToArray();
}

But the problem is that I have no way of knowing what my actual progress is. Thoughts? Suggestions?

Michael Mankus
  • 4,628
  • 9
  • 37
  • 63
  • The problem I think you're going to have is that you would need to know how many distinct values you have up front in order to set the max value of the progress indicator. You of course won't know this value until your query has actually run...You could always try and speed up this process by using some parallalism (not sure of your framework though) http://msdn.microsoft.com/en-us/library/dd383943.aspx – Damon Aug 15 '13 at 13:26

3 Answers3

4

Don't call ToArray() at the end of your method, just use yield return. So do this:

public static IEnumerable<AnalysisResult> Distinct(AnalysisResult[] results)
{
    var query = results.Distinct(new AnalysisResultDistinctItemComparer());

    foreach(AnalysisResult ar in query)
    {
        // Use yield return here, so that the iteration remains lazy.
        yield return ar;
    }
}

Basically, yield return does some compiler magic to ensure that the iteration remains lazy, so you don't have to wait for a complete new collection to be created before returning to the caller. Instead, as each item is calculated, you return that item immediately to the consumer (which can then perform the updating logic -- per-item if necessary). You can use the same technique in your GetDistinct method as well.

Jon Skeet has an implementation that looks like this (LINQ's Distinct() on a particular property):

public static IEnumerable<TSource> DistinctBy<TSource, TKey>
    (this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
{
    HashSet<TKey> seenKeys = new HashSet<TKey>();
    foreach (TSource element in source)
    {
        if (seenKeys.Add(keySelector(element)))
        {
            yield return element;
        }
    }
}

Notice here that he uses a HashSet, which is built to disallow duplicates. Simply check to see if the item has already been added, and if not, then return it.

That all said, remember that this is an Algorithms-and-Data Structures type question. It would be much easier to do something like this:

Dictionary<Key, Value> distinctItems = new Dictionary<Key, Value>(); 

foreach (var item in nonDistinctSetOfItems) {
    if (distinctItems.ConatainsKey(item.KeyProperty) == false) {
        distinctItems.Add(item.KeyProperty, item);
    }
}

... = distinctItems.Values // This would contain only the distinct items.

That is, a Symbol Table/Dictionary is built for just this sort of problem - associating entries with unique keys. If you keep your data stored this way, it greatly simplifies the problem. Don't overlook the simple solution!

Community
  • 1
  • 1
sircodesalot
  • 11,231
  • 8
  • 50
  • 83
  • Thanks for the info on Jon's approach to getting Distinct objects. Mine is similar but his is better. :-) – Michael Mankus Aug 15 '13 at 13:32
  • I'm accepting this answer for three reasons. 1) You've provided three approaches to determining Distinct items. 2) While the first two approaches do give the Distinct items they do not aid in determining the "progress"; but, the third approach does of course enable me to know where we are in the process of pulling the Distinct items. 3) I had my head up somewhere dark and didn't even think of doing something as simple as the third approach. – Michael Mankus Aug 15 '13 at 13:38
  • Hey, much appreciated. But yeah, while new tools are cool, it's good to remember the classical techniques are classic for a reason! – sircodesalot Aug 15 '13 at 13:41
1

Given the design of that Distinct method, you're iterating over the entire collection every time you call Distinct. Have you considered writing a custom collection that adds to an index each time you add an object to the array?

Dave Swersky
  • 34,502
  • 9
  • 78
  • 118
  • Interesting approach. I would need to keep references to both the unmodified original array and the distinct array. Do you have any examples of this? – Michael Mankus Aug 15 '13 at 13:24
  • I don't have an example, but the concept would be to evaluate on add rather than on query. If your comparison is fixed, the index method could use a reference index to avoid duplication. The yield return method from @sircodesalot is more flexible, giving you the ability to determine the comparison with a lambda at design time. – Dave Swersky Aug 15 '13 at 13:33
0

On the other hand you may use ThreadPool and WaitHandle to run your "Distinct" and "DisplayProgress" business with multiple threads.

public class Sample
{
    public void Run()
    {
        var state = new State();
        ThreadPool.QueueUserWorkItem(DoWork, state);
        ThreadPool.QueueUserWorkItem(ShowProgress, state);
        WaitHandle.WaitAll(new WaitHandle[] {state.AutoResetEvent});
        Console.WriteLine("Completed");
    }

    public void DoWork(object state)
    {
        //do your work here
        for (int i = 0; i < 10; i++)
        {
            ((State) state).Status++;
            Thread.Sleep(1000);
        }

        ((State) state).AutoResetEvent.Set();
    }

    public void ShowProgress(object state)
    {
        var s = (State) state;
        while (!s.IsCompleted())
        {

            if (s.PrintedStatus != s.Status)
                Console.WriteLine(s.Status);
            s.PrintedStatus = s.Status;
        }
    }

    public class State
    {
        public State()
        {
            AutoResetEvent = new AutoResetEvent(false);
        }

        public AutoResetEvent AutoResetEvent { get; private set; }
        public int Status { get; set; }
        public int PrintedStatus { get; set; }
        private bool _completed;
        public bool IsCompleted()
        {
            return _completed;
        }
        public void Completed()
        {
            _completed = true;
            AutoResetEvent.Set();
        }
    }
}
mehmet mecek
  • 2,615
  • 2
  • 21
  • 25