Determining duplicates in a datatable

Question

I have a data table I've loaded from a CSV file. I need to determine which rows are duplicates based on two columns (product_id and owner_org_id) in the datatable. Once I've determined that, I can use that information to build my result, which is a datatable containing only the rows that are not unique, and a data table containing only the rows that are unique.

I've looked at other examples on here and the code I've come up with so far does compile and execute, but it seems to think every row in the data is unique. In reality in the test data there's 13 rows and only 6 are unique. So clearly I'm doing something wrong.

EDIT: Thought I should note, rows that have duplicates should ALL be removed, not just the duplicates of that row. eg if there are 4 duplicates, all 4 should be removed not 3, leaving one unique row from the 4.

EDIT2: Alternatively, if I can select all duplicate rows (instead of trying to select unique rows) it is fine with me. Either way can get me to my end result.

The code in the processing method:

MyRowComparer myrc = new MyRowComparer();
var uniquerows = dtCSV.AsEnumerable().Distinct(myrc);

along with the following:

public class MyRowComparer : IEqualityComparer<DataRow>
{
    public bool Equals(DataRow x, DataRow y)
    {
        //return ((string.Compare(x.Field<string>("PRODUCT_ID"),   y.Field<string>("PRODUCT_ID"),   true)) ==
        //        (string.Compare(x.Field<string>("OWNER_ORG_ID"), y.Field<string>("OWNER_ORG_ID"), true)));
        return
            x.ItemArray.Except(new object[] { x[x.Table.Columns["PRODUCT_ID"].ColumnName] }) ==
            y.ItemArray.Except(new object[] { y[y.Table.Columns["PRODUCT_ID"].ColumnName] }) &&
            x.ItemArray.Except(new object[] { x[x.Table.Columns["OWNER_ORG_ID"].ColumnName] }) ==
            y.ItemArray.Except(new object[] { y[y.Table.Columns["OWNER_ORG_ID"].ColumnName] });
    }

    public int GetHashCode(DataRow obj)
    {
        int y = int.Parse(obj.Field<string>("PRODUCT_ID"));
        int z = int.Parse(obj.Field<string>("OWNER_ORG_ID"));
        int c = y ^ z;
        return c;
    }
}

I don't understand why you are using `Except` - why aren't you just comparing the values for the 2 columns that matter? Also `x.Table.Columns["PRODUCT_ID"].ColumnName` should be identical to "PRODUCT_ID" by definition, so you can skip the column lookup. — PinnyM, May 22 '12 at 17:36

Tim Schmelter · Accepted Answer · 2012-05-22T19:04:29.807

You could use LINQ-To-DataSet and Enumerable.Except/Intersect:

var tbl1ID = tbl1.AsEnumerable()
        .Select(r => new
        {
            product_id = r.Field<String>("product_id"),
            owner_org_id = r.Field<String>("owner_org_id"),
        });
var tbl2ID = tbl2.AsEnumerable()
        .Select(r => new
        {
            product_id = r.Field<String>("product_id"),
            owner_org_id = r.Field<String>("owner_org_id"),
        });


var unique = tbl1ID.Except(tbl2ID);
var both = tbl1ID.Intersect(tbl2ID);

var tblUnique = (from uniqueRow in unique
                join row in tbl1.AsEnumerable()
                on uniqueRow equals new
                {
                    product_id = row.Field<String>("product_id"),
                    owner_org_id = row.Field<String>("owner_org_id")
                }
                select row).CopyToDataTable();
var tblBoth = (from bothRow in both
              join row in tbl1.AsEnumerable()
              on bothRow equals new
              {
                  product_id = row.Field<String>("product_id"),
                  owner_org_id = row.Field<String>("owner_org_id")
              }
              select row).CopyToDataTable();

Edit: Obviously i've misunderstood your requirement a little bit. So you only have one DataTable and want to get all unique and all duplicate rows, that's even more straight-forward. You can use Enumerable.GroupBy with an anonymous type containing both fields:

var groups = tbl1.AsEnumerable()
    .GroupBy(r => new
    {
        product_id = r.Field<String>("product_id"),
        owner_org_id = r.Field<String>("owner_org_id")
    });
var tblUniques = groups
    .Where(grp => grp.Count() == 1)
    .Select(grp => grp.Single())
    .CopyToDataTable();
var tblDuplicates = groups
    .Where(grp => grp.Count() > 1)
    .SelectMany(grp => grp)
    .CopyToDataTable();

At the beginning I only have one datatable, with everything in it. This example seems to assume I already have everything split into two datasets, or am I missing something? — user1366062, May 22 '12 at 18:30
@user1366062: Then i've misunderstood your requirement a litle bit. Edited my answer. — Tim Schmelter, May 22 '12 at 18:44

score 1 · Answer 2 · edited May 23 '17 at 10:27

Your criterium is off. You are comparing sets of objects that you are not interested (Except excludes) in.

Instead, be as clear (data type) as possible and keep it simple:

public bool Equals(DataRow x, DataRow y)
{   
    // Usually you are dealing with INT keys
    return (x["PRODUCT_ID"] as int?) == (y["PRODUCT_ID"] as int?)
      && (x["OWNER_ORG_ID"] as int?) == (y["OWNER_ORG_ID"] as int?);

    // If you really are dealing with strings, this is the equivalent:
    // return (x["PRODUCT_ID"] as string) == (y["PRODUCT_ID"] as string)
    //  && (x["OWNER_ORG_ID"] as string) == (y["OWNER_ORG_ID"] as string)
}

Check for null if that is a possibility. Maybe you want to exclude rows that are equal because their IDs are null.

Observe the int?. This is not a typo. The question mark is required if you are dealing with database values from columns that can be NULL. The reason is that NULL values will be represented by the type DBNull in C#. Using the as operator just gives you null in this case (instead of an InvalidCastException. If you are sure, you are dealing with INT NOT NULL, cast with (int).

The same is true for strings. (string) asserts you are expecting non-null DB values.

EDIT1:

Had the type wrong. ItemArray is not a hashtable. Use the row directly.

EDIT2:

Added string example, some comment

For a more straight-forward way, check How to select distinct rows in a datatable and store into an array

EDIT3:

Some explanation regarding the casts.

The other link I suggested does the same as your code. I forgot your original intent ;-) I just saw your code and responded to the most obvious error, I saw - sorry

Here is how I would solve the problem

using System.Linq;
using System.Data.Linq;

var q = dtCSV
    .AsEnumerable()
    .GroupBy(r => new { ProductId = (int)r["PRODUCT_ID"], OwnerOrgId = (int)r["OWNER_ORG_ID"] })
    .Where(g => g.Count() > 1).SelectMany(g => g);

var duplicateRows = q.ToList();

I don't know if this 100% correct, I don't have an IDE at hand. And you'll need to adjust the casts to the appropriate type. See my addition above.

This appears to remove all of the duplicates except for one per group of duplicates. for example if I have 5 rows all with matching product_id and owner_org_id, I get 1 row with those matching product_id and owner_org_id. I would like to also remove that final row. — user1366062, May 22 '12 at 18:24
Additionally, I could not use the 'as int' even though they are int as I get an error on the column's name that says 'the as operator must be used with a reference type or nullable type'. I checked out your link for the more straightforward way, which seemed to partially work, however I had trouble having it return an entire row and not just the two columns in question. I need the entire row but only comparing the two columns for duplicates. I could have done it wrong, but the example seemed fairly clear. — user1366062, May 22 '12 at 18:28
@user1366062: I responded to both of your comments in my answer. There is a part about the casts in the middle and an addition at the end — skarmats, May 22 '12 at 18:50
Just saw the edit on the other answer. Take his code for a more complete solution. He takes the resulting rows and feeds them back into a `DataTable` which might be what you want. — skarmats, May 22 '12 at 18:52

Determining duplicates in a datatable

2 Answers2

Linked

Related