C#.NET LinQ remove Dupes from list with a twist

Question

I have a list of Company objects with name, address, email, etc. I need a way to sort the list by company name (which I can easily do). But then I need to compare the email and city. If one of those matches a company with the same name, I need to remove BOTH of them.

I can easily do the .First to get one and remove the matching one but I need the original to be removed from the list as well. Make sense? The bottom 2 LINQ statements remove the duplicate but not the original

for (int i = 2; i <= 10749; i++) //Range.UsedRange isnt working for some reason so I hard coded the row numbers
{
    firm = new Firm();
    string begCell;
    begCell = "B" + i;
    excelCell = (Microsoft.Office.Interop.Excel.Range)excelWorksheet.get_Range(begCell, begCell);
    firm.acctNo = fixNullValues(Convert.ToString(excelCell.Value2));

    begCell = "C" + i;
    excelCell = (Microsoft.Office.Interop.Excel.Range)excelWorksheet.get_Range(begCell, begCell);
    firm.name = fixNullValues(Convert.ToString(excelCell.Value2));

    begCell = "D" + i;
    excelCell = (Microsoft.Office.Interop.Excel.Range)excelWorksheet.get_Range(begCell, begCell);
    firm.addy1 = fixNullValues(Convert.ToString(excelCell.Value2));

    begCell = "E" + i;
    excelCell = (Microsoft.Office.Interop.Excel.Range)excelWorksheet.get_Range(begCell, begCell);
    firm.city = fixNullValues(Convert.ToString(excelCell.Value2));

    begCell = "F" + i;
    excelCell = (Microsoft.Office.Interop.Excel.Range)excelWorksheet.get_Range(begCell, begCell);
    firm.state = fixNullValues(Convert.ToString(excelCell.Value2));

    begCell = "G" + i;
    excelCell = (Microsoft.Office.Interop.Excel.Range)excelWorksheet.get_Range(begCell, begCell);
    firm.zip = fixNullValues(Convert.ToString(excelCell.Value2));

    begCell = "H" + i;
    excelCell = (Microsoft.Office.Interop.Excel.Range)excelWorksheet.get_Range(begCell, begCell);
    firm.phone = fixNullValues(Convert.ToString(excelCell.Value2));

    begCell = "I" + i;
    excelCell = (Microsoft.Office.Interop.Excel.Range)excelWorksheet.get_Range(begCell, begCell);
    firm.email = fixNullValues(Convert.ToString(excelCell.Value2));

    begCell = "AB" + i;
    excelCell = (Microsoft.Office.Interop.Excel.Range)excelWorksheet.get_Range(begCell, begCell);
    if (Convert.ToString(excelCell.Value2).Equals("Active", StringComparison.OrdinalIgnoreCase))
        firm.active = true;

    if (firm.active)
        firmList.Add(firm);

}

var sortedList = firmList.OrderBy(o => o.name).ToList();
var removeDupEmails = firmList.GroupBy(x => x.name).SelectMany(x =>
                             x.OrderByDescending(y => y.email)
                             .Take(1)
                         );
var removeDupZips = firmList.GroupBy(x => x.name).SelectMany(x =>
                             x.OrderByDescending(y => y.addy)
                             .Take(1)
                         );

You can `GroupBy` on multiple columns using anonymous objects. http://stackoverflow.com/questions/847066/group-by-multiple-columns — Shelby115, Apr 19 '16 at 16:55
Note that there's ambiguity in this comparison. Suppose two companies have the same name, but different Email and City. So they're not duplicates. Now there is a third company having the same name, and Email of company1 and City of company2. Does that make three duplicates, even though company1 and company2 aren't duplicates? — Gert Arnold, Apr 19 '16 at 20:20
if they have the (same name and addy) OR (same name and email) OR (same addy and email) then they are considered duplicates. I was unclear on that so it is my fault. I apoligize — Dan H, Apr 19 '16 at 21:49
OK, let's have company c1: A,B,C (name, email,city), c2: A,D,E. Not equal right? Add c3: A,B,E. Now we have c1 = c3 and c2 = c3, but c1 != c2. It doesn't compute. — Gert Arnold, Apr 20 '16 at 06:40

jdweng · Answer 1 · 2016-04-19T18:24:00.563

I think this will work. Instead of removing only add the ones that meet the criteria. I updated the code. Found some errors.

    class Program
    {
        const string FILENAME = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            List<Company> companies = new List<Company>();
            var groups = companies.AsEnumerable().GroupBy(x => x.name)
                .Select(x => x.Where(y => (x.Select(z => z.email).ToList().Where(s => s == y.email).Count() == 1) && (x.Select(z => z.addy).ToList().Where(s => s == y.addy).Count() == 1)).Select(a => a)).SelectMany(b => b).ToList();


        }

    }
    public class Company
    {
        public string name {get;set;}
        public string email {get;set;}
        public string addy {get;set;}
    }

Thanks and forgive my ignorance but the code isnt able to run. the part that is giving me issues is: y => x.SelectMany(z => !z.email.Contains(y.email)). The error is Error 5 The type arguments for method 'System.Linq.Enumerable.SelectMany(System.Collections.Generic.IEnumerable, System.Func>)' cannot be inferred from the usage. Try specifying the type arguments explicitly. — Dan H, Apr 19 '16 at 17:52

score 0 · Answer 2 · answered Apr 19 '16 at 18:23

To eliminate duplicates, you can use the following pattern:

source.GroupBy(element => element.Property) // Split the list in groups by the property you want to check
.Where(propertyGroup => propertyGroup.Count() > 1) // Filter the duplicates
.SelectMany(propertyGroup => propertyGroup) // Flatten back the result set

If I understand correctly the concrete question, you can achieve the desired result with a single query like this:

var resultList = firmList
    .GroupBy(firm => firm.name)
    .OrderBy(nameGroup => nameGroup.Key)
    .SelectMany(nameGroup => nameGroup
        .GroupBy(firm => firm.email)
        .Where(emailGroup => emailGroup.Count() > 1)
        .SelectMany(emailGroup => emailGroup)
        .GroupBy(firm => firm.addy)
        .Where(addyGroup => addyGroup.Count() > 1)
        .SelectMany(addyGroup => addyGroup)
    )
    .ToList();

C#.NET LinQ remove Dupes from list with a twist

2 Answers2