Remove duplicates from Dataset but log the removed rows

Question

I have data retrieved from one SQL table where the value that will be my PK in the table I need to import it to can come from one of 2 fields.

This does mean I can get duplicates so I need to obviously remove them before ingesting into the destination table.

I also need to mark the duplicate records that I have not ingested as processed so they don't get picked up again so I collect the "ID" value (will be unique) from the source table for the duplicate rows in an array. Currently I have the code below working, just worried it might be slow on large datasets (I'm expecting probably around 50-60K rows at max).

If ds.Tables(0).Rows.Count > 0 Then
        Dim DupeID_List = Nothing
        Dim DupeID_Count As Integer = 0
        Dim hTable As New Hashtable()
        Dim duplicateList As New ArrayList()
        For Each drow__1 As DataRow In ds.Tables(0).Rows
            If hTable.Contains(drow__1("EIBItemID")) Then
                ReDim Preserve DupeID_List(DupeID_Count)
                duplicateList.Add(drow__1)
                DupeID_List(DupeID_Count) = CStr(drow__1("ID"))
                DupeID_Count = DupeID_Count + 1
            Else
                hTable.Add(drow__1("EIBItemID"), String.Empty)
            End If
        Next
        For Each dRow__2 As DataRow In duplicateList
            ds.Tables(0).Rows.Remove(dRow__2)
        Next
        If Not DupeID_List Is Nothing Then
            Call MarkDupeRecordsExported(DupeID_List)
        End If
        Return ds
    Else
        Return Nothing
    End If

Is there a better way of achieving the same goal, check for duplicate "EIBItemID" values and remove them from the dataset that will be bulkcopied to the destination table but update the source table for each record (ID) that is removed?

Yes, it looks like there is room for a fair amount of improvement. Could you clarify the 2 steps/criteria though? Not for nothing, but you have asked 5 previous questions and gotten 5 answers, but none of them have been accepted. Accepting answers helps others find good answers and shows that you participate in the process. Please take the (brief) [Tour] to see how SO works. — Ňɏssa Pøngjǣrdenlarp, Dec 21 '16 at 18:37
[CodeReview](http://codereview.stackexchange.com/help/how-to-ask) may be a better fit for this question. — Gert Arnold, Dec 21 '16 at 20:06
Sorry you are correct, have marked them all an answered now. The source table has a PK called ID and a value that I derive from part of potentially 2 other values . CAST(CASE WHEN IsNumeric(SUBSTRING(Field_3,10,8)) = 0 THEN SUBSTRING(Optional_20,10,8) ELSE SUBSTRING(Field_3,10,8) END As Int) As EIBItemID This value may have duplicates but EIBItemID is the PK in the destination table. I need to only take one record with each EIBItemID but I need to update the source table (ExportStatus = 'Y') otherwise the process will see the duplicates next time and try and import them which would error. — Colster, Dec 22 '16 at 08:10

score 0 · Answer 1 · edited May 23 '17 at 12:01

0

use linq with distinct;

Sample :

Dim nonDublicatesTable = From row In ds.Tables(0).AsEnumerable()
            Select row.Field(Of String)("uniquefield") Distinct

Then truncate ds.Tables(0) and put nonDublicatesTable datas to ds.Tables(0)

Or u can look these topics answer : Delete all Duplicate Rows except for One in MySQL?

edited May 23 '17 at 12:01

Community

1
1

answered Dec 21 '16 at 19:18

Oktay H. Kuzu

24
1

Thanks for that but that doesn't allow me to capture the "ID" of the rows deleted. I need to update the source table so these records don't just get pulled again the next time the process runs – Colster Dec 22 '16 at 08:11

genespos · Answer 2 · 2016-12-22T09:54:17.440

You can try using an Hashset while looping the rows of your datatable. For each row you can add the Id to the Hashset and check if it was added by counting the elements stored in it. If the Hashset count remain the same after trying to add an Id, it means that the Id you tried to add is already in the Hashset. Knowing this you can add the row to a new datatable having the same fields (cloned). At the end of the loop you'll have all rows with duplicated ids into the new datatable.

''Use an HashSet to store Ids
Dim IdsWithoutDup as Hashset(Of YourIdType)'Pseudocode, Use the correct Type
'Use a DataTable to store dulicated Rows
Dim DupRows As DataTable = ds.Tables(0).Clone()

For Each drow__1 As DataRow In ds.Tables(0).Rows
    Dim PreCount = IdsWithoutDup.Count
    IdsWithoutDup.Add(drow__1("EIBItemID"))
    'If Id wasn't added to HashSet (because it's already in)
    'Then add the row to the cloned table
    If IdsWithoutDup.Count = PreCount Then
        DupRows.Add(drow__1)
    End If
    '... do other stuff you need
Next

score 0 · Accepted Answer · answered Dec 22 '16 at 12:23

Took some elements from the answers provided to cut it down at least to one loop

    If ds.Tables(0).Rows.Count > 0 Then
        Dim NonDupesDT As DataTable = ds.Tables(0).Clone
        Dim DupeID_List = Nothing
        Dim DupeID_Count As Integer = 0
        Dim hTable As New Hashtable()
        For Each drow__1 As DataRow In ds.Tables(0).Rows
            If hTable.Contains(drow__1("EIBItemID")) Then
                ReDim Preserve DupeID_List(DupeID_Count)
                DupeID_List(DupeID_Count) = CStr(drow__1("ID"))
                DupeID_Count = DupeID_Count + 1
            Else
                hTable.Add(drow__1("EIBItemID"), String.Empty)
                NonDupesDT.Rows.Add(drow__1.ItemArray)
            End If
        Next
        If Not DupeID_List Is Nothing Then
            Call MarkDupeRecordsExported(DupeID_List)
        End If
        Return NonDupesDT
    Else
        Return Nothing
    End If

This way I can create a list of the "ID" values of duplicates and then in the ELSE add rows to a new table with only one version of each duplicate "EIBItemID"

I can then pass the new table back with duplicates effectively removed and the array to mark the duplicates ID's as processed so they won't get pulled on the next run.

Remove duplicates from Dataset but log the removed rows

3 Answers3