7

What is the fastest way to union 2 sets of sorted values? Speed (big-O) is important here; not clarity - assume this is being done millions of times.

Assume you do not know the type or range of the values, but have an efficent IComparer<T> and/or IEqualityComparer<T>.

Given the following set of numbers:

var la = new int[] { 1, 2, 4, 5, 9 };
var ra = new int[] { 3, 4, 5, 6, 6, 7, 8 };

I am expecting 1, 2, 3, 4, 5, 6, 7, 8, 9. The following stub may be used to test the code:

static void Main(string[] args)
{
    var la = new int[] { 1, 2, 4, 5, 9 };
    var ra = new int[] { 3, 4, 5, 6, 6, 7, 8 };

    foreach (var item in UnionSorted(la, ra, Int32Comparer.Default))
    {
        Console.Write("{0}, ", item);
    }
    Console.ReadLine();
}

class Int32Comparer : IComparer<Int32>
{
    public static readonly Int32Comparer Default = new Int32Comparer();
    public int Compare(int x, int y)
    {
        if (x < y)
            return -1;
        else if (x > y)
            return 1;
        else
            return 0;
    }
}

static IEnumerable<T> UnionSorted<T>(IEnumerable<T> sortedLeft, IEnumerable<T> sortedRight, IComparer<T> comparer)
{
}
Cœur
  • 37,241
  • 25
  • 195
  • 267
Jonathan Dickinson
  • 9,050
  • 1
  • 37
  • 60
  • Related: http://stackoverflow.com/questions/7164572/c-fastest-intersection-of-2-sets-of-sorted-number/7164983#7164983 – Jonathan Dickinson Aug 23 '11 at 17:35
  • 1
    Do you know anything about the contents? For example, do you know that they are all integers in the range of 0 to 63? Do you know anything about the size of the sets? You can get big performance wins if you are willing to severely restrict the size and range of the contents. – Eric Lippert Aug 23 '11 at 17:38
  • @Downvoter - http://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/ – Jonathan Dickinson Aug 23 '11 at 17:38
  • Your Compare can just return y-x –  Aug 23 '11 at 17:39
  • 2
    This is the "Merge" step of Merge sort. http://en.wikipedia.org/wiki/Merge_sort – BrokenGlass Aug 23 '11 at 17:39
  • @Eric - I asked the question because I implemented the wrong set operation for the related question :). I have already answered it. – Jonathan Dickinson Aug 23 '11 at 17:39
  • What makes this better then `HashSet.UnionWith()`? – Chad La Guardia Aug 23 '11 at 17:39
  • @Chad the list is already sorted - this is known ahead of time, HashSet has overhead because it assumes the values are not. – Jonathan Dickinson Aug 23 '11 at 17:41
  • 2
    @jdv-Jan de Vaan: **Do not do that; integer subtraction is not the same operation as integer comparison.** What if y-x overflows? See http://blogs.msdn.com/b/ericlippert/archive/2011/01/27/spot-the-defect-bad-comparisons-part-three.aspx for details, and you might want to read the rest of that series as well if you're the sort of person inclined to write bad comparison functions. – Eric Lippert Aug 23 '11 at 17:41
  • Before downvoting or commenting check the related question. – Jonathan Dickinson Aug 23 '11 at 17:41
  • @eric: true. I should have added "if the range is known to be limited to 2^31 -1". –  Aug 23 '11 at 17:50
  • Also it isn't restricted to integers - I purposefully used the word 'values', the test harness just uses 32 bit integers because it's easy. – Jonathan Dickinson Aug 23 '11 at 17:55
  • I think the "fastest possible algorithm" and the open-endedness of the values being pushed in are contradictory. If you really have no idea what might be sent in, we cannot tailor a custom method- there will always be values that can be entered that will make it slower than another possible solution. – Mike M. Aug 23 '11 at 18:22

4 Answers4

3

The following method returns the correct results:

static IEnumerable<T> UnionSorted<T>(IEnumerable<T> sortedLeft, IEnumerable<T> sortedRight, IComparer<T> comparer)
{
    var first = true;

    var continueLeft = true;
    var continueRight = true;

    T left = default(T);
    T right = default(T);

    using (var el = sortedLeft.GetEnumerator())
    using (var er = sortedRight.GetEnumerator())
    {
        // Loop until both enumeration are done.
        while (continueLeft | continueRight)
        {
            // Only if both enumerations have values.
            if (continueLeft & continueRight)
            {
                    // Seed the enumeration.
                    if (first)
                    {
                        continueLeft = el.MoveNext();
                        if (continueLeft)
                        {
                            left = el.Current;
                        }
                        else 
                        {
                            // left is empty, just dump the right enumerable
                            while (er.MoveNext())
                                yield return er.Current;
                            yield break;
                        }

                        continueRight = er.MoveNext();
                        if (continueRight)
                        {
                            right = er.Current;
                        }
                        else
                        {
                            // right is empty, just dump the left enumerable
                            if (continueLeft)
                            {
                                // there was a value when it was read earlier, let's return it before continuing
                                do
                                {
                                    yield return el.Current;
                                }
                                while (el.MoveNext());
                            } // if continueLeft is false, then both enumerable are empty here.
                            yield break;
                        }

                        first = false;
                    }

                // Compare them and decide which to return.
                var comp = comparer.Compare(left, right);
                if (comp < 0)
                {
                    yield return left;
                    // We only advance left until they match.
                    continueLeft = el.MoveNext();
                    if (continueLeft)
                        left = el.Current;
                }
                else if (comp > 0)
                {
                    yield return right;
                    continueRight = er.MoveNext();
                    if (continueRight)
                        right = er.Current;
                }
                else
                {
                    // The both match, so advance both.
                    yield return left;
                    continueLeft = el.MoveNext();
                    if (continueLeft)
                        left = el.Current;
                    continueRight = er.MoveNext();
                    if (continueRight)
                        right = er.Current;
                }
            }
            // One of the lists is done, don't advance it.
            else if (continueLeft)
            {
                yield return left;
                continueLeft = el.MoveNext();
                if (continueLeft)
                    left = el.Current;
            }
            else if (continueRight)
            {
                yield return right;
                continueRight = er.MoveNext();
                if (continueRight)
                    right = er.Current;
            }
        }
    }
}

The space is ~O(6) and time ~O(max(n,m)) (where m is the second set).

Pac0
  • 21,465
  • 8
  • 65
  • 74
Jonathan Dickinson
  • 9,050
  • 1
  • 37
  • 60
  • 2
    There are some bugs in this code. 1 it needs to yield return right after `else if (continueRight)`. Also I think the 'seed the enumeration' part should be after the while in case both collections are empty – mcintyre321 Aug 29 '14 at 11:04
  • In case both lists are empty, the current version returns a list with a zero element. – Pac0 Oct 01 '18 at 15:13
  • Actually, 0 will be added as soon as one of the lists is empty. – Pac0 Oct 01 '18 at 15:24
  • I made a fix for the "empty input set" edge cases. I put all the additional logic in the existing `if(first)` block, so it shouldn't impede in any way the good performance of this code by adding more branching at each iteration. @Jonathan Dickinson, please tell me if this is too much an edit for your answer, and I will make another separate answer in this case. – Pac0 Oct 02 '18 at 08:14
2

I'm going to give LINQ the benefit of the doubt and say this is probably as fast as you are going to get without writing excessive code:

var result = la.Union(ra);

EDITED: Thanks, I missed the sorted part.

You could do:

var result = la.Union(ra).OrderBy(i => i);
Kevin Kalitowski
  • 6,829
  • 4
  • 36
  • 52
  • Yes but **it does not assume the lists are sorted** and the question does ask for the 'fastest' implementation. – Jonathan Dickinson Aug 23 '11 at 17:42
  • @kd7, It will be if `ra` is(the order of `la` doesn't matter as it is buffered), but the Union method doesn't use the knowledge of la and ra being sorted. – JBSnorro Aug 23 '11 at 17:44
  • Actually @JBSnorro try running it, it won't be sorted as asked. If you just iterate through the results, with just result=la.Union(ra), with OrderBy however, it will work. – Ta01 Aug 23 '11 at 17:44
  • It's still not the fastest; "no excessive code" wasn't in the prerequisite. We are looking for what is 'algoritmitically' the fastest implementation, this type of code was a hot spot in the related question. – Jonathan Dickinson Aug 23 '11 at 17:47
  • @Kevin could you give us the big-O for space and time? – Jonathan Dickinson Aug 23 '11 at 17:56
  • @kd7, you are correct and I'm completely wrong. Sorry about that – JBSnorro Aug 23 '11 at 17:57
  • 4
    Warning, joke ahead: This is absolutely the fastest way to do it, bar none. Writing one line of code versus 50+? But you probably wanted the most performant code, I get that. – Kevin Kalitowski Aug 23 '11 at 18:02
  • I'm with @KevinKalitowski on this one – Ta01 Aug 23 '11 at 18:14
  • @Kevin - yes, this is assuming that someone needs a place where something like this is being done (due to a profiler indicating it's a hotspot). From the previous question "NOTE: I'm doing this about 5.3 million times. So every microsecond counts." – Jonathan Dickinson Aug 23 '11 at 18:15
2

This will make your UnionSorted function a little less versatile, but you can make a small improvement by making an assumption about types. If you do the comparison inside the loop itself (rather than calling the Int32Comparer) then that'll save on some function call overhead.

So your UnionSorted declaration becomes this...

static IEnumerable<int> UnionSorted(IEnumerable<int> sortedLeft, IEnumerable<int> sortedRight)

And then you do this inside the loop, getting rid of the call to comparer.Compare()...

//var comp = comparer.Compare(left, right); // too slow

int comp = 0;
if (left < right)
    comp = -1;
else if (left > right)
    comp = 1;

In my testing this was about 15% faster.

Community
  • 1
  • 1
Steve Wortham
  • 21,740
  • 5
  • 68
  • 90
  • Wow that much of a difference. I might have to accept YOUR answer if you could link to mine. – Jonathan Dickinson Aug 23 '11 at 19:20
  • Oh, is this debug or release build, and with or without the debugger? – Jonathan Dickinson Aug 23 '11 at 19:20
  • @Jonathan - This is a release build, without the debugger. I used your example arrays looped 10 million times. The performance gap may just widen with larger lists. – Steve Wortham Aug 23 '11 at 19:24
  • @Jonathan - By the way, I also tried abandoning the `IEnumerable` entirely in favor of iterating through the arrays directly but this turned out to be about 3X slower. I'd love to find out why. But your approach was clearly faster. – Steve Wortham Aug 23 '11 at 19:36
  • Array bounds checking. I think `IEnumerable` from `Array` has some magic to prevent it. You could try unsafe code. – Jonathan Dickinson Aug 23 '11 at 19:42
  • Ah, that makes some sense. I also found this little article about IEnumerable... http://weblogs.asp.net/justin_rogers/archive/2004/10/01/236837.aspx – Steve Wortham Aug 23 '11 at 19:58
0

I would solve the problem this way. (I am making an assumption which lightens the difficulty of this problem significantly, only to illustrate the idea.)

Assumption: All numbers contained in sets are non-negative.

Create a word of at least n bits, where n is the largest value you expect. (If the largest value you expect is 12, then you must create a word of 16 bits.).

Iterate through both sets. For each value, val, or the val-th bit with 1.

Once done, count the amount of bits set to 1. Create an array of that size. Go through each bit one by one, adding n to the new array if the n-th bit is set.

Zéychin
  • 4,135
  • 2
  • 28
  • 27