5

I have a class called Customer that has several string properties like

firstName, lastName, email, etc.  

I read in the customer information from a csv file that creates an array of the class:

Customer[] customers  

I need to remove the duplicate customers having the same email address, leaving only 1 customer record for each particular email address.

I have done this using 2 loops but it takes nearly 5 minutes as there are usually 50,000+ customer records. Once I am done removing the duplicates, I need to write the customer information to another csv file (no help needed here).

If I did a Distinct in a loop how would I remove the other string variables that are a part of the class for that particular customer as well?

Thanks, Andrew

Arghya C
  • 9,805
  • 2
  • 47
  • 66
AWooster
  • 105
  • 3
  • 9
  • Is the idea to run this daily/weekly/quarterly? Frequency of this task will likely dictate the permanence of a solution. – mjw Dec 07 '15 at 20:25
  • 1
    Distinct will not work for custom types without using new equality comparer. use DistinctBy from MoreLinq. btw this operation will not take much time for 50k items since distinct is `O(n)` – M.kazem Akhgary Dec 07 '15 at 20:25
  • My choice would probably be to sort the input file by duplicate key (email in your case) and do a simple previous to current value comparison before adding to your object. – mjw Dec 07 '15 at 20:27
  • 1
    I'd use a `KeyedCollection` (in `System.Collectons.ObjectModell`). Let the eMail be the Key and insert after checking with `Contains`. This is very fast... – Shnugo Dec 07 '15 at 20:28
  • Possibly related / helpful: http://stackoverflow.com/questions/2537823/distinct-by-property-of-class-by-linq – joranvar Dec 07 '15 at 20:32
  • What do you mean "I have done this using 2 loops" ? – Mike Nakis Dec 07 '15 at 20:34
  • When you find a duplicate, how will decide which record to keep, bearing in mind that both records might not have the same data in all of the fields? – Joel Coehoorn Dec 07 '15 at 21:29
  • This will be run daily. To use the loops I have the outer loop set to cycle through the array of customers, each time grabbing the current customer email, then an inner loop to run back through all of the customers checking the email addresses, if they are found to have a duplicate, I set a boolean field on the matching customer to mark for deletion. I don't actually delete the customer entry, when I write to the file I check to see if this boolean is true or not first to determine if I should write or not. – AWooster Dec 07 '15 at 21:59

2 Answers2

9

With Linq, you can do this in O(n) time (single level loop) with a GroupBy

var uniquePersons = persons.GroupBy(p => p.Email)
                           .Select(grp => grp.First())
                           .ToArray();

Update

A bit on O(n) behavior of GroupBy.

GroupBy is implemented in Linq (Enumerable.cs) as this -

The IEnumerable is iterated only once to create the grouping. A Hash of the key provided (e.g. "Email" here) is used to find unique keys, and the elements are added in the Grouping corresponding to the keys.

Please see this GetGrouping code. And some old posts for reference.

Then Select is obviously an O(n) code, making the above code O(n) overall.

Update 2

To handle empty/null values.

So, if there are instances where the value of Email is null or empty, the simple GroupBy will take just one of those objects from null & empty each.

One quick way to include all those objects with null/empty value is to use some unique keys at the run time for those objects, like

var tempEmailIndex = 0;
var uniqueNullAndEmpty = persons
                         .GroupBy(p => string.IsNullOrEmpty(p.Email) 
                                       ? (++tempEmailIndex).ToString() : p.Email)
                         .Select(grp => grp.First())
                         .ToArray();
Community
  • 1
  • 1
Arghya C
  • 9,805
  • 2
  • 47
  • 66
  • *"As Linq is using Reflection"* - do you have a reference for this? – Arghya C Dec 07 '15 at 20:48
  • 1
    @Shnugo - LINQ doesn't use reflection. – Enigmativity Dec 07 '15 at 21:00
  • Can you please tell us how you know that this will execute in O(n) ? – Mike Nakis Dec 07 '15 at 21:22
  • This is working extremely well, but for some reason it is not writing anything to the csv file now. I see that after deleting the duplicates it returns the new array of customers and has data in it, and I have not changed an of the code that writes to the file... – AWooster Dec 07 '15 at 22:08
  • nvm, had to pass in the count of the customer records when creating a new Customer array. – AWooster Dec 07 '15 at 22:12
  • @AWooster seems like you have solved the csv creation problem, good to know it helped :) – Arghya C Dec 08 '15 at 05:12
  • @MikeNakis - It's `O(n)` because it only has to iterate the list once to build the groups. – Enigmativity Dec 08 '15 at 13:17
  • @Enigmativity this is not really an explanation, because in building the groups it is going to have to use a target data structure to add the groups to, so the running time depends on the time complexity of a) checking for existence and b) adding to that data structure. I have thought about it, and the explanation is that the data structure that will be used will probably be a hash set, which is O(1), so the total combined time is O(2N), which is O(N). What bugs me is that Arghya did not bother to answer. – Mike Nakis Dec 08 '15 at 13:29
  • 1
    @MikeNakis please see I had updated the answer with explanation and references. – Arghya C Dec 08 '15 at 14:23
  • @ArghyaC In my testing I have found (obviously) that this is removing customers that do not have an email address associated, is there an easy way to exclude this? If not, I will assign a random number to the customers email address field when there is nothing present. Thanks. – AWooster Dec 08 '15 at 21:51
  • @AWooster see if the **update 2** solves your problem? – Arghya C Dec 09 '15 at 04:32
0

I'd do it like this:

public class Person {
    public Person(string eMail, string Name) {
        this.eMail = eMail;
        this.Name = Name;
    }
    public string eMail { get; set; }
    public string Name { get; set; }
}
public class eMailKeyedCollection : System.Collections.ObjectModel.KeyedCollection<string, Person> {
    protected override string GetKeyForItem(Person item) {
        return item.eMail;
    }
}

public void testIt() {
    var testArr = new Person[5];
    testArr[0] = new Person("Jon@Mullen.com", "Jon Mullen");
    testArr[1] = new Person("Jane@Cullen.com", "Jane Cullen");
    testArr[2] = new Person("Jon@Cullen.com", "Jon Cullen");
    testArr[3] = new Person("John@Mullen.com", "John Mullen");
    testArr[4] = new Person("Jon@Mullen.com", "Test Other"); //same eMail as index 0...

    var targetList = new eMailKeyedCollection();
    foreach (var p in testArr) {
        if (!targetList.Contains(p.eMail))
            targetList.Add(p);
    }
}

If the item is found in the collection, you could easily pick (and eventually modify) it with:

        if (!targetList.Contains(p.eMail))
            targetList.Add(p);
        else {
           var currentPerson=targetList[p.eMail];
           //modify Name, Address whatever... 
        }
Shnugo
  • 66,100
  • 9
  • 53
  • 114