Removing duplicates from datatable

Question

I'm trying to remove duplicates in a datatable similar to this question. However, when I do this I need to do it on an ordered dataset, as one of the criteria is that time is one of my columns, and I need only the earliest time instance to remain.

I also came across this question on ordered lists from a datatable, but I'm not sure how to combine the two.

Basically, I'm reading a file into a dataset, then I want to sort on time and three other columns, and delete all duplicates leaving the earliest time instance. The columns in question are Name (int), phone number (long), time (int) and location (string). If the name, phone and location are duplicated, remove everything after the first (Earliest) time.

dsHoldingSet.Tables["FileData"].Columns.Add("location", typeof(string));
dsHoldingSet.Tables["FileData"].Columns.Add("name", typeof(int));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(string));
dsHoldingSet.Tables["FileData"].Columns.Add("time", typeof(int));
dsHoldingSet.Tables["FileData"].Columns.Add("phone", typeof(long));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(int));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(string));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(string));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(string));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(int));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(int));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(string));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(int));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(int));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(long));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(string));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(string));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(Boolean));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(string));
dsHoldingSet.Tables["FileData"].Columns.Add("field", typeof(string));

That's the table definition, then we add rows as we validate lines in the file.

You can do an `OrderBy` on your `DataTable` using LINQ, Can you add some related code with your question ? — Habib, Apr 23 '14 at 17:49
@Habib - I have no code. I have a dataset that I need to do operations on, and I've been reading to figure out how to do what I need. — JohnP, Apr 23 '14 at 17:53
it would be hard to reply to your question. You can show how your DataSet/DataTable is defined, what are the columns (Name and type). How you want to get distinct values ? are you looking to get distinct values based on a single column/multiple columns. How you like to order them *(based on which column)*. This will provide the community a better option to reply. — Habib, Apr 23 '14 at 17:55

Rex M · Accepted Answer · 2014-04-23T18:48:09.513

What we want to do is group the rows by the distinct values. If we want to use LINQ against a DataTable, the easiest way is using the built-in DataTable.AsEnumerable() extension method. This returns an IEnumerable<DataRow> for you.

Once we've got that, we need to construct a comparable object out of the composite of the three values. Here I used the approach of string concatenation, because strings are easy to compare. There are other ways you could do this, but this one is simple:

name|phone|location

This produces a sequence of IGrouping<string, DataRow>. Each grouping is also an IEnumerable<DataRow> which represents the subset for that group. So if we sort each grouping object by time, and pull the first one off, that's the first row.

Here's the complete code.

var rows = dsHoldingSet.Tables["FileData"].AsEnumerable()
    .GroupBy(row => string.Format("{0}|{1}|{2}",
        row.Field<string>("name"),
        row.Field<string>("phone"),
        row.Field<string>("location"))
    .Select(group => 
        group.OrderBy(row => row.Field<TimeSpan>("time")).First());

Some other notes - phone should be a string, not a long; unless time represents some other kind of measure you haven't gone into, it should either be a TimeSpan or a DateTime. The first thing you want to do when loading a data set to manipulate is coerce data into their most robust and correct data types - it makes the actual manipulation MUCH easier. You can deconvert back if you need to after it's done.

I'll work on that, first I have to figure out what you wrote. :) Learning curves can be steep sometimes. Does the above code reflect changing it to a timespan first, or is it written to consider it as a numerical? — JohnP, Apr 23 '14 at 18:22
@JohnP just replace the `Field` with whatever the actual value type is. — Rex M, Apr 23 '14 at 18:24

Removing duplicates from datatable

1 Answers1