2

I have a datatable the has duplicate lines. I need to get the duplicates and compare the duplicate lines for the best value in certain columns.

DataTable dt = new DataTable();

dt.Rows.Add(1, "Test1", "584", 12);
dt.Rows.Add(2, "Test2", "32", 123);
dt.Rows.Add(3, "Test3", "425", 54);
dt.Rows.Add(4, "Test1", "4", 755);
dt.Rows.Add(5, "Test5", "854", 879);
dt.Rows.Add(6, "Test2", "1", null);
dt.Rows.Add(7, "Test2", "999", 3);

Notice Test 1 and 2 have duplicates.

(1, "Test1", "584", 12)
(4, "Test1", "4", 755)

(2, "Test2", "32", 123)
(6, "Test2", "1", null)
(7, "Test2", "999", 3)

Now that I have the duplicates. I need to make one line that has the best values. New datatable should show:

Test1 = "Test1", "584", 755
Test2 = "Test2", "999", 123
Test3 = "Test3", "425", 54
Test5 = "Test5", "854", 879
Rand Random
  • 7,300
  • 10
  • 40
  • 88
54 69 6d
  • 167
  • 16

3 Answers3

1
//GroupBy(x => x[1]) = groupby the second column
//Where(x => x.Count() > 1) = only get groups that have a count greater than 1, so duplicates
var duplicates = dt.Rows.OfType<DataRow>().GroupBy(x => x[1]).Where(x => x.Count() > 1).ToList();

//enumerate all duplicates
foreach (var duplicate in duplicates)
{
    //enumerate each row of the duplicate
    foreach (var dataRow in duplicate)
    {
        //do something…
        //I don't know your rules why a row is better than the other, so that part you have to figure out yourself, or extend your question
    }
}

Maybe you are looking for this:

DataTable dt = new DataTable();
dt.Columns.Add("Id", typeof(int));
dt.Columns.Add("Text", typeof(string));
dt.Columns.Add("Value1", typeof(string));
dt.Columns.Add("Value2", typeof(int));

dt.Rows.Add(1, "Test1", "584", 12);
dt.Rows.Add(2, "Test2", "32", 123);
dt.Rows.Add(3, "Test3", "425", 54);
dt.Rows.Add(4, "Test1", "4", 755);
dt.Rows.Add(5, "Test5", "854", 879);
dt.Rows.Add(6, "Test2", "1", null);
dt.Rows.Add(7, "Test2", "999", 3);

var duplicates = dt.Rows.OfType<DataRow>().GroupBy(x => x[1]).Where(x => x.Count() > 1).ToList();

//get the current highestId (first column) so that when we remove duplicates and a new row the new row will get the next available id
var highestId = dt.Rows.OfType<DataRow>().Max(x => (int)x[0]);

//enumerate all duplicates
foreach (var duplicate in duplicates)
{
    //get the highest value of each column
    var newId = ++highestId;
    var newText = duplicate.Key;
    var newValue1 = duplicate.Max(x => x[2]); //this does a string comparison, instead of a numeric one, this means that for example that 2 is bigger then 10

    // use this if you need numeric comparison
    var newValue1AsNumeric = duplicate.Select(x =>
    {
        if (int.TryParse(Convert.ToString(x[2]), out var value))
            return value;

        return (int?)null;
    }).OfType<int>().Max(); 

    var newValue2 = duplicate.Select(x => x[3]).OfType<int>().Max();

    //enumerate each row of the duplicate
    foreach (var dataRow in duplicate)
        dt.Rows.Remove(dataRow);

    dt.Rows.Add(newId, newText, newValue1, newValue2);
}

You can see code in action here: https://dotnetfiddle.net/rp1DUc

Rand Random
  • 7,300
  • 10
  • 40
  • 88
  • Thanks. Unfortunately, I should have mentioned the version of .net. This project is using 4.0, was upgraded from 3.5. So ``` var value ``` isn't going to work – 54 69 6d Nov 13 '19 at 11:28
  • @54696d and where is your issue? I have no problem running this code with .net 4.0 - as you can see here https://i.stack.imgur.com/4hgje.png – Rand Random Nov 13 '19 at 11:32
  • `var value` is no .net Framework feature but a c# language feature which c# language version are you using and why are you sticking to this version? – Rand Random Nov 13 '19 at 11:35
  • It's telling me "Feature 'out variable declaration' is not available in C# 4 Please use language version 7.0 or greater – 54 69 6d Nov 13 '19 at 11:35
  • Okay, but why are you sticking to this old language version? – Rand Random Nov 13 '19 at 11:36
  • lol I don't want to. We are building a newer product, just not right at the moment. So unfortunately I need to make changes to this version. – 54 69 6d Nov 13 '19 at 11:38
  • I believe there is a confusion, about what .net framework version does to your application, and what the c# language version does, the c# language "only" enable new syntatic features - so the the compiler knows if you write `var xxx` what the hell you want to tell - only if you change the .net framework version you will for example limit certain operating systems eg. Windows xp not supporting .net Framework 4.8 - but you can develop a .net Framework 1.0 with the highest available c# language Version, so you should reconsider your choice – Rand Random Nov 13 '19 at 11:41
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202267/discussion-between-54-69-6d-and-rand-random). – 54 69 6d Nov 13 '19 at 11:44
  • @54696d - have a look at this: https://stackoverflow.com/questions/31548699 – Rand Random Nov 13 '19 at 12:45
0

Use DataTable.AsEnumerable() // LINQ Then use GroupBy(), // LINQ filter records, Process Them, create new DataTable / Remove non required records from same data table Done.

Shyam
  • 182
  • 1
  • 14
0

I named your columns to make things a bit easier:

DataTable dt = new DataTable();
dt.Columns.Add("id", Type.GetType("System.Int32"));
dt.Columns.Add("group", Type.GetType("System.String"));
dt.Columns.Add("first", Type.GetType("System.String"));
dt.Columns.Add("second", Type.GetType("System.Int32"));

dt.Rows.Add(1, "Test1", "584", 12);
dt.Rows.Add(2, "Test2", "32", 123);
dt.Rows.Add(3, "Test3", "425", 54);
dt.Rows.Add(4, "Test1", "4", 755);
dt.Rows.Add(5, "Test5", "854", 879);
dt.Rows.Add(6, "Test2", "1", null);
dt.Rows.Add(7, "Test2", "999", 3);

You can then group and find the maximums (assuming that is what you mean by 'best') using Linq:

var group = dt.AsEnumerable().GroupBy(row => row.Field<string>("group")).Select(g => new
{
    group = g.Key,
    first = g.Max(row => int.Parse(row.Field<string>("first"))).ToString(),
    second = g.Max(row => row.Field<int?>("second") ?? 0)
}).ToList();

This gives you a list that matches your desired output. I have assumed that a null value should be considered to be a value of 0. You can then put the values back into your original DataTable:

dt.Clear();
var rowCount = 1;
foreach (var x in group)
    dt.Rows.Add(rowCount++, x.group, x.first, x.second);

theduck
  • 2,589
  • 13
  • 17
  • 23