How can I decrease the amount of memory and operations involved in the ""Find all the pairs that add to N" interview challenge?

Question

I was asked in an interview to write a function for finding all pairs of ints in an array that add up to N. My answer was kinda bulky:

HashSet<Tuple<int,int>> PairsThatSumToN ( int [] arr, int N )
{
    HashSet<int> arrhash = new HashShet<int> (arr);
    HashSet<Tuple<int,int>> result = new HashSet<Tuple<int,int>>();    
    for ( int i in arrhash ) 
    {
       int j = N - i;
       if ( arrhash.Contains(j) ) result.Add(new Tuple<int,int> (i,j));
    }
    return result;
}

I'm a beginner to C#, come from a C++ background, and I have a few questions about how to make this better:

Is it innefficient to iterate through a HashSet? In other words, would my procedure be more efficient (although less compact) if I changed it to

HashSet<Tuple<int,int>> PairsThatSumToN ( int [] arr, int N )
{
   HashSet<int> arrhash = new HashShet<int> ();
   HashSet<Tuple<int,int>> result = new HashSet<Tuple<int,int>>();    
   for ( int i in arr ) 
   {
      int j = N - i;
      if ( arrhash.Contains(j) ) result.Add(new Type<int,int> (i,j));
      arrHash.Add(i);
   }
   return result;
}

?????

I realize that Add is more like an "Add if not already in there", so I have a useless operation whenever I run result.Add(new Tuple<int,int> (i,j)) for an i,j pair that is already in the set. The more repeated pairs in the array, the more useless operations, and there's all the overhead of allocating the new Tuple that may never be used. Is there a way to optimize this by checking whether the pair i,j is a Tuple in the set before creating a new Tuple out of said pair and trying to add it?
Speaking of the above allocation of a new Tuple on the heap, do I need to free this memory if I don't end up adding that Tuple to the result? Potential memory leak here?
There has to be some way of combining the two sets
```
HashSet<int> arrhash = new HashShet<int> (arr);
HashSet<Tuple<int,int>> result = new HashSet<Tuple<int,int>>(); 
```
In a sense, they contain redundant information since every int in the second one is also in the first one. Something feels "wrong" about having to sets here, yet I can't think of a better way to do it.
Better yet, does the .NET library have any way of doing a 1-line solution for the problem? ;)

Paging Dr. Skeet.

I think I would sort the array and then come at it from both ends. — paparazzo, Nov 20 '15 at 07:33
@Frisbee: unfortunately, sorting is an O(N Log(N)) operation. — , Nov 20 '15 at 07:41
@user5572578 Your second implementation beats the first by 1. returning only unique pairs (the first will have duplicates), and 2. being 10% faster — Oguz Ozgul, Nov 20 '15 at 07:51
@Frisbee: in theory, not. In practice, will probably depend on N and on the particular data set. — , Nov 20 '15 at 07:53
In theory, theory and practice are the same. In practice, they are not. (Albert Einstein) — Oguz Ozgul, Nov 20 '15 at 07:56
HashSet> is poor choice. Tuple does not hash nicely. And add is O(n). — paparazzo, Nov 20 '15 at 08:06

Oguz Ozgul · Answer 1 · 2015-11-20T08:51:28.063

2

If you need a neat solution for your problem, here it is, implemented with LINQ.

The performance however, is 4 times worse than your second solution.

Since you have asked for a one liner, here it is anyway.

NOTE: I would appreciate any improvements especially to get rid of that Distinct() since it takes the 50% of the overall cpu time

static List<Pair> PairsThatSumToN(int[] arr, int N)
{
    return
    (
        from x in arr join y in arr on N - x equals y select new Pair(x, y)
    )
    .Distinct()
    .ToList();
}

public class Pair : Tuple<int, int>
{
    public Pair(int item1, int item2) : base(item1, item2) { }

    public override bool Equals(object pair)
    {
        Pair dest = pair as Pair;
        return dest.Item1 == Item1 || dest.Item2 == Item1;
    }

    public override int GetHashCode()
    {
        return Item1 + Item2;
    }
}

edited Nov 20 '15 at 08:51

answered Nov 20 '15 at 08:42

Oguz Ozgul

6,809
1
14
26

1

i feel so stupid when i saw `return Item1 + Item2;` xD. nice. – M.kazem Akhgary Nov 20 '15 at 08:47
:) Don't feel that way. – Oguz Ozgul Nov 20 '15 at 08:48
1

And now with the Pair class it's no more one liner :-) – Ivan Stoev Nov 20 '15 at 11:20
1

@ivan-stoev It is, if we keep that Pair class out of sight in it's own file :-) – Oguz Ozgul Nov 20 '15 at 13:36
1

You got me - LOL. One liner two file-r :-) – Ivan Stoev Nov 20 '15 at 13:39
:-) The requirement did not mention anything about the number of files so it is out of scope :-) – Oguz Ozgul Nov 20 '15 at 13:41
Take a look at your `GetHashCode` method. For this special use case all of your pairs having the same hash code (the value of N). Maybe you can increase performance if you take one which is better hashing your pairs. Or you design your `Pair` class to store the items in a specific order e.g. take that constructor: `public Pair(int item1, int item2) : base(Math.Max(item1,item2), Math.Min(item1,item2) { }` and do not override `Equals` nor `GetHashCode`. – abto Nov 23 '15 at 06:59

paparazzo · Accepted Answer · 2015-11-21T00:22:19.030

2

This is what I would try

    public Dictionary<int, int> Pairs(int[] arr, int N)
    {
        // int N asssumes no arr > int32 max / 2 
        int len = arr.Length < N ? arr.Length / 2 : N / 2;
        Dictionary<int, int> d = new Dictionary<int, int>(len); 
                          // add is O(1) if count <= capacity 
        if(arr.Length == 0) return d;
        Array.Sort(arr);  // so it is O(n log n) I still take my chances with it
                          // that is n * log(n)         
        int start = 0;
        int end = arr.Length - 1;
        do
        {
            int ttl = arr[start] + arr[end];
            if (ttl == N)
            {
                if(!d.ContainsKey(arr[start]))
                      d.Add(arr[start], arr[end]); 
                      // if start <= end then pair uniquely defined by either 
                      // and a perfect hash (zero collisions)
                start++;
                end--;
            }
            else if (ttl > N)
                end--;
            else
                start++;
            if(start >= end)
                return d;
        }   while (true);
    }

Even with a HashSet based solution still use Dictionary(N/2) with Key <= Value

Or use Dictionary(arr.Length / 2)

edited Nov 21 '15 at 00:22

answered Nov 20 '15 at 09:01

paparazzo

44,497
23
105
176

I don't think your procedure works if `arr` is empty – user5572578 Nov 20 '15 at 14:46
@user5572578 Yes it would fail on empty. Did you actually test this solution? Is it slower than the accepted solution? – paparazzo Nov 20 '15 at 14:53
I tested and compared the original algorithm from the OP and all of the proposed solutions including my own for cpu times with the same amount of data and this one is the best by far. Almost 7 times faster than the original (2nd one) which is the second best – Oguz Ozgul Nov 20 '15 at 21:54
@OguzOzgul I guess the OP just dismissed it because it did deal with arr is empty. 7 times as fast with 1 up vote and not the accepted answer. – paparazzo Nov 20 '15 at 21:59
Up vote is mine by the way, voted as soon as I did the test. Very nice approach there. You must be a math person. – Oguz Ozgul Nov 20 '15 at 22:04
@OguzOzgul Yes I am an applied mathematician. Way to often people take O too seriously. Is the n 1 ms or 1 second. – paparazzo Nov 20 '15 at 22:11
+1 for the classic algorithm. Unfortunately improper implementation. First, returning a Dictionary is a strange choice and is causing exception when the input set contains duplicates. Second (after modifying it to return List>), it neither returns unique nor all pairs. – Ivan Stoev Nov 20 '15 at 23:50
@IvanStoev Dictionary is strange? "if start <= end then pair uniquely defined by either". If the total is N then the pair is uniquely defined by either the smaller or greater. From that you get a size of /2 and that makes it faster. As for duplicate then if contains key continue - I will add that. – paparazzo Nov 21 '15 at 00:36
Dictionary is for searching, which is not the goal for the expected result. Also the result is supposed to be a pair, not KeyValuePair - which one is a Key, which one - a Value and why? The function should simple yield pairs and let the caller decide whether he wants list, array, dictionary etc. Even with the latest addition the function is incorrect. In the case of `ttl == N` it should either count equal values at both ends and emit `startCount * endCount` times the found pair (in case we want all), or emit the found pair and skip equal values at both ends (in case we want unique only). – Ivan Stoev Nov 21 '15 at 00:51
@IvanStoev KVP is a pair. Two number that add up to N then either the smaller or larger unique identifies the pair. I don't follow and this is not productive. – paparazzo Nov 21 '15 at 01:12
@Frisbee As I stated at the beginning, you have my upvote, so all I wrote was not supposed to be offensive. Just a comments. So let stop here. Cheers. – Ivan Stoev Nov 21 '15 at 01:16

M.kazem Akhgary · Answer 3 · 2015-11-20T08:15:45.630

1

First of all HashSet removes duplicate items. So iterating through HashSet or Array may yield different results since the array may have duplicate items.

Iterating through HashSet is ok. but note that it should not be used for only iterating purpose. BTW using HashSet is best option here because of O(1) for finding items.

~~Tuples are compared by reference inside HashSet. That means two different tuples with same items are never equal by default. since they always have different reference.~~ (Sorry my mistake.) it seems tuples are compared by their items. But it compares only x.item1 to y.item1 and x.item2 to y.item2. so 1,2 and 2,1 are not equal. you can make them equal by setting another IEqualityComparer to hashset.

You should not be worry about memory leaks. when HashSet fails to add tuple the garbage collector will remove that tuple when the reference of that tuple is gone. Not immediately but when its needed.

static HashSet<Tuple<int, int>> PairsThatSumToN(int[] arr, int N)
{
    HashSet<int> hash = new HashSet<int>(arr);
    HashSet<Tuple<int, int>> result = new HashSet<Tuple<int, int>>(new IntTupleComparer());

    foreach(int i in arr)
    {
        int j = N - i;
        if (hash.Contains(j)) result.Add(new Tuple<int, int>(i, j));
    }
    return result;
}

public class IntTupleComparer : IEqualityComparer<Tuple<int, int>>
{
    public bool Equals(Tuple<int, int> x, Tuple<int, int> y)
    {
        return (x.Item1 == y.Item1 && x.Item2 == y.Item2) || (x.Item1 == y.Item2 && x.Item2 == y.Item1);
    }

    public int GetHashCode(Tuple<int, int> obj)
    {
        return (obj.Item1 + obj.Item2).GetHashCode();
    }
}

edited Nov 20 '15 at 08:15

answered Nov 20 '15 at 07:39

M.kazem Akhgary

18,645
8
57
118

Have you tested this? It finds 929 pairs for an int[] with values 0 - 999 and N = 141 and requires 10 times more cpu, and worst of all, it returns pairs with values not inside the array. You should fix it (it returns 0, -141 for example) – Oguz Ozgul Nov 20 '15 at 07:44
@OguzOzgul fixed it. the problem was the condition. `j!=0` – M.kazem Akhgary Nov 20 '15 at 07:47
1

Please remember that, when trying to achieve O(1), you are ignoring the fact that j should also be a member of the array – Oguz Ozgul Nov 20 '15 at 07:48
71 pairs should be returned for array 0 - 999 and N = 141 – Oguz Ozgul Nov 20 '15 at 07:49
Ok, now it returns unique pairs, but requires 2x cpu time – Oguz Ozgul Nov 20 '15 at 07:53
@OguzOzgul oh my god. yeah you are right. i did a mistake. j should also be a member of the array. so i should include this check. – M.kazem Akhgary Nov 20 '15 at 07:53
@OguzOzgul why do you assume it requires 2x cpu time? (note that tuples should be compared by their items not reference.) – M.kazem Akhgary Nov 20 '15 at 07:59
I am not assuming. I measured it. – Oguz Ozgul Nov 20 '15 at 08:09
Thats because of extra check `(x.Item1 == y.Item2 && x.Item2 == y.Item1)`. so for example `1,2` and `2,1` would be equal. @OguzOzgul – M.kazem Akhgary Nov 20 '15 at 08:17
Your `GetHashCode` method always deliver the same hash for the elements in question (the hash of N). Maybe you can increase performance if you redesign it , so that it delivers different hashes for different pairs (something like `(Math.Max(obj.item1,obj.item2) ^397)*Math.Min(obj.item1,item2)`). – abto Nov 23 '15 at 07:04

score 1 · Answer 4 · answered Nov 20 '15 at 12:30

If the input set contains unique numbers, or the function must return only unique pairs, I think your second algorithm is the best. Just the result doesn't need to be a HashSet<Tuple<int, int>>, because the uniqueness is guaranteed by the algorithm - a simple List<Tuple<int, int>> would do the same, and better abstraction would be IEnumerable<Tuple<int, int>>. Here is how it looks implemented with C# iterator function:

static IEnumerable<Tuple<int, int>> UniquePairsThatSumToN(int[] source, int N)
{
    var set = new HashSet<int>();
    for (int i = 0; i < source.Length; i++)
    {
        var a = source[i];
        var b = N - a;
        if (set.Add(a) && set.Contains(b))
            yield return Tuple.Create(b, a);
    }
}

The key point is the line if (set.Add(a) && set.Contains(b)). Since both HashSet<T>.Add and HashSet<T>.Contains are O(1), the whole algorithm is therefore O(N).

With a relatively small modification we can make a function that returns all pairs (not only unique) like this

static IEnumerable<Tuple<int, int>> AllPairsThatSumToN(int[] source, int N)
{
    var countMap = new Dictionary<int, int>(source.Length);
    for (int i = 0; i < source.Length; i++)
    {
        var a = source[i];
        var b = N - a;
        int countA;
        countMap.TryGetValue(a, out countA);
        countMap[a] = ++countA;
        int countB;
        if (countMap.TryGetValue(b, out countB))
            while (--countB >= 0)
                yield return Tuple.Create(b, a);
    }
}

@Frisbee I would have done that with pleasure (as with Dictionary in my second snippet), however unfortunately MS haven't provided a HashSet constructor accepting the capacity. — Ivan Stoev, Nov 20 '15 at 22:31
Bummer as then the first is O(n * n) as Add is O(n) if count < capacity. https://msdn.microsoft.com/en-us/library/bb353005.aspx — paparazzo, Nov 21 '15 at 00:28
@Frisbee Well, not really because the size is doubled, so in the worst case rehashing happens Log2(N) times. — Ivan Stoev, Nov 21 '15 at 00:59

How can I decrease the amount of memory and operations involved in the ""Find all the pairs that add to N" interview challenge?

4 Answers4