11

I have a List<CustomPoint> points; which contains close to million objects. From this list I would like to get the List of objects that are occuring exactly twice. What would be the fastest way to do this? I would also be interested in a non-Linq option also since I might have to do this in C++ also.

public class CustomPoint
{
    public double X { get; set; }
    public double Y { get; set; }

    public CustomPoint(double x, double y)
    {
        this.X = x;
        this.Y = y;
    }
}

public class PointComparer : IEqualityComparer<CustomPoint>
{
    public bool Equals(CustomPoint x, CustomPoint y)
    {
        return ((x.X == y.X) && (y.Y == x.Y));
    }

    public int GetHashCode(CustomPoint obj)
    {
        int hash = 0;
        hash ^= obj.X.GetHashCode();
        hash ^= obj.Y.GetHashCode();
        return hash;
    }
}

based on this answer, i tried,

list.GroupBy(x => x).Where(x => x.Count() = 2).Select(x => x.Key).ToList(); 

but this is giving zero objects in the new list. Can someone guide me on this?

Community
  • 1
  • 1
vinayan
  • 1,597
  • 3
  • 19
  • 42
  • Where do you get the data from? – Mike Perrenoud Dec 20 '12 at 12:02
  • @MichaelPerrenoud - these are geographical latitude and longitude values – vinayan Dec 20 '12 at 12:05
  • Yes, but where is the data being loaded from (e.g. SQL Server Database)? I'm asking because you mentioned it was close to a million objects and also that you may need to access it in C++ – Mike Perrenoud Dec 20 '12 at 12:37
  • @MichaelPerrenoud - ok..I am reading these points from a shapefile(http://en.wikipedia.org/wiki/Shapefile). I am using an application called Quantum GIS and its C++ API.. – vinayan Dec 20 '12 at 12:55
  • 2
    You may or may not be aware of this, but comparing doubles for equality is usually not a good idea. See http://stackoverflow.com/questions/1398753/comparing-double-values-in-c-sharp – Sebastian Negraszus Dec 20 '12 at 13:27
  • Please choose the best answer!! – Saw Dec 27 '12 at 18:42
  • all the answers were equally good and iformative...unfortunately only one can be marked as answer – vinayan Dec 28 '12 at 03:41

3 Answers3

9

You should implement Equals and GetHashCode in the class itself and not in the PointComparer

Saw
  • 6,199
  • 11
  • 53
  • 104
  • i implemented IEqualityComparer for the point class..it didn't work..i must be doing something awfully wrong – vinayan Dec 20 '12 at 12:23
  • No, it is Ok, but you can use in another way, but if you want to use equals and gethashcode methods, you have to use them on the main class and not on the IEqualityComparer – Saw Dec 20 '12 at 12:25
4

To get your code working, you need to pass an instance of your PointComparer as a second argument to GroupBy.

Rawling
  • 49,248
  • 7
  • 89
  • 127
  • i tried points.GroupBy(new PointComparer()).Where(g => g.Count() == 2).Select(g => g.Key).ToList(); it doesn't compile – vinayan Dec 20 '12 at 12:10
  • Sorry, you need both the lambda _and_ the `new PointComparer()` (two parameters). – Rawling Dec 20 '12 at 12:11
3

This method works for me:

public class PointCount
{
    public CustomPoint Point { get; set; }
    public int Count { get; set; }
}

private static IEnumerable<CustomPoint> GetPointsByCount(Dictionary<int, PointCount> pointcount, int count)
{
    return pointcount
                    .Where(p => p.Value.Count == count)
                    .Select(p => p.Value.Point);
}

private static Dictionary<int, PointCount> GetPointCount(List<CustomPoint> pointList)
{
    var allPoints = new Dictionary<int, PointCount>();

    foreach (var point in pointList)
    {
        int hash = point.GetHashCode();

        if (allPoints.ContainsKey(hash))
        {
            allPoints[hash].Count++;
        }
        else
        {
            allPoints.Add(hash, new PointCount { Point = point, Count = 1 });
        }
    }

    return allPoints;
}

Called like this:

static void Main(string[] args)
{
    List<CustomPoint> list1 = CreateCustomPointList();

    var doubles = GetPointsByCount(GetPointCount(list1), 2);

    Console.WriteLine("Doubles:");
    foreach (var point in doubles)
    {
        Console.WriteLine("X: {0}, Y: {1}", point.X, point.Y);
    }
}

private static List<CustomPoint> CreateCustomPointList()
{
    var result = new List<CustomPoint>();

    for (int i = 0; i < 5; i++)
    {
        for (int j = 0; j < 5; j++)
        {
            result.Add(new CustomPoint(i, j));
        }
    }

    result.Add(new CustomPoint(1, 3));
    result.Add(new CustomPoint(3, 3));
    result.Add(new CustomPoint(0, 2));

    return result;
}

CustomPoint implementation:

public class CustomPoint
{
    public double X { get; set; }
    public double Y { get; set; }

    public CustomPoint(double x, double y)
    {
        this.X = x;
        this.Y = y;
    }

    public override bool Equals(object obj)
    {
        var other = obj as CustomPoint;

        if (other == null)
        {
            return base.Equals(obj);
        }

        return ((this.X == other.X) && (this.Y == other.Y));
    }

    public override int GetHashCode()
    {
        int hash = 23;
        hash = hash * 31 + this.X.GetHashCode();
        hash = hash * 31 + this.Y.GetHashCode();
        return hash;
    }
}

It prints:

Doubles:
X: 0, Y: 2
X: 1, Y: 3
X: 3, Y: 3

As you see in GetPointCount(), I create a dictionary per unique CustomPoint (by hash). Then I insert a PointCount object containing a reference to the CustomPoint which starts at a Count of 1, and every time the same point is encountered, the Count is increased.

Finally in GetPointsByCount I return the CustomPoints in the dictionary where PointCount.Count == count, in your case 2.

Please also note I updated the GetHashCode() method, since your one returns the same for point (1,2) and (2,1). If you do want that, feel free to restore your own hashing method. You will have to test the hashing function though, because it's hard to uniquely hash two numbers into one. That depends on the range of numbers used though, so you should implement a hash function that fits your own needs.

CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • This is the same of my answer, but with a code, it doesn't need code, he can write the code very easily, thenx for your time – Saw Dec 20 '12 at 12:29
  • 3
    @MSakherSawan _"This is the same of my answer"_ - no it is not, I implemented the `GetDoubles()` method which, contrary to your answer, does answer the question. – CodeCaster Dec 20 '12 at 12:30
  • +1 for the effort, GJ :) But I'd also recommend to use a PLINQ with divide and conquer for some extra speed (I'm assuming speed is relevant in this case). – flindeberg Dec 20 '12 at 12:55
  • 2
    @flindeberg that is true, but you'll have to synchronize the dictionary access to prevent concurrency problems. Don't know if you'll gain much benefit from that. – CodeCaster Dec 20 '12 at 13:08
  • 1
    @CodeCaster I was thinking more in terms of map-reduce :) But it of course depends on the composition of the original data, if there are alot of similar points map-reduce would really help, but if there isn't it really wouldn't. – flindeberg Dec 20 '12 at 13:19
  • thanks for the detailed answer. Are u sure that (1,2) and (2,1) would return the same hash? I am getting 45653674 and 41149443 respectively.. – vinayan Dec 20 '12 at 16:37
  • 1
    @vinayan CodeCaster's implementation of `GetHashCode` _deliberately_ avoids doing that because usually it's a bad idea; if you _want_ it to be the case then simply replace his `GetHashCode` with the one from your question. – Rawling Dec 21 '12 at 08:09
  • @Rawling - I want (1,2) and (2,1) to be different..but with my existing GetHashCode, they already give different hashes.. – vinayan Dec 21 '12 at 11:27