3

I have created a class, as below, to represent a Composite Primary Key model:

public class PrimaryKeyModel
{
    public string ColumnName { get; set; }
    public string ColumnValue { get; set; }
    public int RowNumber { get; set; } // always unique
}

It basically represents the names/values of the columns which together make up the primary key, plus the row number where this combination belongs; originally in a Spreadsheet.

I have then put this model in a List and populated it with data from a spreadsheet:

List<PrimaryKeyModel> primaryKeysList = new List<PrimaryKeyModel>; 

I would like to check primaryKeysList and see if it has any duplicated values, and if it has, I would like to know the Row numbers where these values are duplicated.

I have tried different ways such as loading this list into a HashSet, a dictionary and to use this solution here at this link but non of it worked. Is there anyway I could resolve this.

Thanks.

Update - Here is a sample data display. UniqueColumnsModel is the same as PrimaryKeyModel; I have changed it here to make it clearer.

enter image description here

Edit: Clarification of the question

I am trying to import data from a spreadsheet (which can have many types(one for sales, one for quotes ..etc.)) into a database. A configuration table in the database determines which column(s) in a spreadsheet will constitute the primary key in the destination table. My task is to create a routine which validate spreadsheet data before being it being uploaded (imported) into the database using my application. I want ot validate that the columns set as the composites of the primary key, do not contain any duplicated data, so that the primary key constraint is NOT violated in the destination table on insert..

The list mentioned here (PrimaryKeyModel) contains the name of the column in the spreadsheet (which together with others constitutes the primary key), the value of the column in the spreadsheet and the row number in the spreadsheet where this value exists. The list gets populated via a foreach row/ foreach column loops. So I hope this elaborates the question better.

Community
  • 1
  • 1
t_plusplus
  • 4,079
  • 5
  • 45
  • 60
  • This is a really good chance to use BinarySearch on the list, passing a custom comparer for PrimaryKeyModel. BinarySearch returns ones complement results indicating the index at which the item exists. – Haney Dec 17 '13 at 15:20
  • 1
    @DavidHaney First off, a Binary Search is for finding one item, not for looking for duplicates, second, that requires having sorted data, which doesn't appear to be the case. – Servy Dec 17 '13 at 15:21
  • @Servy right, but I meant he could use one at the time of adding to the list to check for the duplicate while adding... Single O(n) data traversal with an O(log n) binary search per item. – Haney Dec 17 '13 at 15:22
  • 1
    @DavidHaney That would have *terrible* performance. That's O(n^2) when it could be O(1) instead, in addition to being quite a bit more work. You've essentially just described an insertion sort. – Servy Dec 17 '13 at 15:24
  • 1
    I assume you're defining "duplicate" as "items with the same `ColumnName` and `ColumnValue`"? – D Stanley Dec 17 '13 at 15:26
  • Could you please define **duplicate**? Are two items duplicated iff they have the same `RowNumber` or all properties should be equal? I'm confused a bit by a comment. – Ilya Ivanov Dec 17 '13 at 15:27
  • Duplicate does not mean the column name but the value. So for an instance, there can be many composites columns for a primary key. the values in these columns (columns with the given name), are what determine a duplicate or unique. @IlyaIvanov – t_plusplus Dec 17 '13 at 15:29
  • 2
    @t_plusplus: The whole question is *really* confusing, to be honest - partly because of your choice of class name. I strongly advise you to give some sample input and expected output. – Jon Skeet Dec 17 '13 at 15:30
  • @t_plusplus Can you please add some example data to the question showing various edge cases? That will help clear up confusion a lot. – Esoteric Screen Name Dec 17 '13 at 15:30
  • @Servy what's a better alternative? – Haney Dec 17 '13 at 15:31
  • @DavidHaney Take a look at either of the answers. Those are much better. – Servy Dec 17 '13 at 15:36
  • @t_plusplus My answer should do what you asked. – Bas Dec 17 '13 at 15:45
  • @EsotericScreenName please see the sample data in the edit thanks. – t_plusplus Dec 17 '13 at 15:46
  • @BasBrekelmans Thank you, and thank you all; I am trying all. Will provide my solution in the end. Thanks again. – t_plusplus Dec 17 '13 at 15:49
  • @t_plusplus Please indicate which items in your sample data are considered duplicates. If none are, please add more data which shows duplicates. – Esoteric Screen Name Dec 17 '13 at 15:52
  • @EsotericScreenName look at my answer, this indicates what duplicates are. – Bas Dec 17 '13 at 15:54
  • @BasBrekelmans While I interpret the question the same as you do, you aren't the OP and so can't authoritatively state what constitutes a duplicate entry. Plenty of others have read the question differently, so we should get clarity from the OP in order to ensure we arrive at the correct answer. This is also an important exercise for t_plusplus in order to learn how to quickly ask definitively answerable questions in the future. – Esoteric Screen Name Dec 17 '13 at 15:55

4 Answers4

4

GroupBy works well for this:

primaryKeysList.GroupBy(pk => new {pk.ColumnName, pk.ColumnValue})
               .Where(g => g.Count() > 1)
               .SelectMany(g => g);   // flatten the groups into a single list
D Stanley
  • 149,601
  • 11
  • 178
  • 240
2

EDIT: I may well have misread the question, and inferred too much from your class name being PrimaryKeyModel - I had interpreted that to be a model for a primary key, and that you wanted to find duplicate primary keys. If that's not the case, I urge you to reconsider your naming... at that point, D Stanley's answer is probably what you want, but you should consider ColumnName/ColumnValue to be the "primary key" here - the row number is not part of the key, logically.


Original answer

You appear not to have overridden Equals(object) or GetHashCode - which means every object is considered different to every other one. You probably want something like:

public sealed class PrimaryKeyModel : IEquatable<PrimaryKeyModel>
{
    // TODO: Make these read-only (mutable keys are a bad idea...)
    public string ColumnName { get; set; }
    public string ColumnValue { get; set; }
    public int RowNumber { get; set; }

    public override bool Equals(object other)
    {
        return Equals(other as PrimaryKeyModel);
    }

    public bool Equals(PrimaryKeyModel other)
    {
        return other != null &&
               ColumnName == other.ColumnName &&
               ColumnValue == other.ColumnValue &&
               RowNumber == other.RowNumber;
    }

    public override int GetHashCode()
    {
        int hash = 23;
        hash = hash * 31 + ColumnName == null ? 0 : ColumnName.GetHashCode();
        hash = hash * 31 + ColumnValue == null ? 0 : ColumnValue.GetHashCode();
        hash = hash * 31 + RowNumber;
        return hash;
    }
}

This is assuming you really want all three fields to be the same - if you only care about RowNumber, you can simplify those implementations (but at that point it's an odd primary key).

After that, you can use Distinct(), or a HashSet, or a Dictionary etc. Of course, an alternative is to explicitly group by different properties - but it feels like this ought to implement equality sensibly. As noted in comments, I'd urge you to make the properties read-only though.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    Great implementation of `Equals`/`GetHashCode` but the question asks to find duplicate _name/value_ pais and the _row numbers_ of those duplicates. – D Stanley Dec 17 '13 at 15:24
  • @DStanley: Hmm - I think I see, in which case `PrimaryKeyModel` is a *horrible* name for the class. It's still unclear though - will edit. – Jon Skeet Dec 17 '13 at 15:27
2

If your class represents this kind of structure:

ColumnName    ColumnValue   RowNumber
Id            3             1
Id2           1             1 
Id            1             2 
Id2           2             2
Id            3             3 
Id2           1             3 //duplicate

Then all other answers so far are incorrect and you need to do it differently, group by row number and then compare each field one by one. Because equality is commutative we can speed up the loop slightly so we don't compare each item twice.

List<PrimaryKeyModel> keys = new List<PrimaryKeyModel>()
{
        new PrimaryKeyModel("Id", "3", 1),
        new PrimaryKeyModel("Id2", "1", 1),
        new PrimaryKeyModel("Id", "1", 2),
        new PrimaryKeyModel("Id2", "1", 2),
        new PrimaryKeyModel("Id", "3", 3),
        new PrimaryKeyModel("Id2", "1", 3),
};

var groupedKeys = keys.OrderBy(pk => pk.ColumnName).GroupBy(k => k.RowNumber).ToList();
HashSet<int> duplicateRowNumbers = new HashSet<int>();

for (int i = 0; i < groupedKeys.Count - 1; i++)
{
    for (int j = i + 1; j < groupedKeys.Count; j++)
    {
        if (AreTheSame(groupedKeys[i], groupedKeys[j]))
        {
            duplicateRowNumbers.Add(groupedKeys[i].First().RowNumber);
            duplicateRowNumbers.Add(groupedKeys[j].First().RowNumber);
        }
    }
}

private static bool AreTheSame(IEnumerable<PrimaryKeyModel> a, IEnumerable<PrimaryKeyModel> b)
{
    var leftEnumerator = a.GetEnumerator();
    var rightEnumerator = b.GetEnumerator();
    while (leftEnumerator.MoveNext() | rightEnumerator.MoveNext())
    {
        if (leftEnumerator.Current == null) return false;
        if (rightEnumerator.Current == null) return false;
        if (leftEnumerator.Current.ColumnValue != rightEnumerator.Current.ColumnValue) return false;
    }

    return true;
}
Bas
  • 26,772
  • 8
  • 53
  • 86
  • Upon further testing with different patterns of data, I found that this solution fails when a 'combination' of columns have duplicate values. It may work with some adjustment as it presently tests each column individually rather than collectively. for each iteration in a row of data all the columns together MUST give a unique value. Thanks anyway, I have put my solution below. @Bas Brekelmans – t_plusplus Dec 18 '13 at 12:43
  • I think there might be a problem with your explanation, my solution sorts the columns by name and only returns true if all pairs of column values in the two row numbers have exactly the same value. @t_plusplus – Bas Dec 18 '13 at 12:55
  • Thank you. I will check it again and see why it failed with some pattern of data. @Bas Brekelmans – t_plusplus Dec 19 '13 at 09:28
0

This was the final solution which worked for me. This ensures that duplicates do not exist in a row of a list i.e., list of list. It basically pours the list's contents into a hashset which returns false if a newly added item is already existing in the list:

Thanks for all those who contributed to solving this above!

HashSet<string> primaryKeyChecker = new HashSet<string>();

foreach (var row in rows)
{

    StringBuilder primaryKey = new StringBuilder();
    //Get rowCount;

    foreach (var column in columns)
    {
        (if column is a composite of a primaryKey)
        {
            get column value;
            append it to stringBuilder to form the primaryKey
        }   
    }

                            var addOutcome = primaryKeyChecker.Add(primaryKey.ToString());

                            if (!addOutcome)
                            {
                                //Report a duplicate record and give the rowNumber where this occured.
                            }


}

Update

To come out of the issue highlighted with @Bas below, just make sure that when concatenate the primary keys; to seperate them with a coma or 0 so that the highlighted scenario wont occurre.. so do something like this:

  primaryKey.Append(currentValue + ",");
t_plusplus
  • 4,079
  • 5
  • 45
  • 60
  • this will break with input (21, 2) and (2, 12), because you concatenate strings and will report a duplicate. – Bas Dec 19 '13 at 08:13
  • Thanks for highlighting this issue to me, I put the resolution for it in the Update above @Bas Brekelmans – t_plusplus Dec 20 '13 at 10:00