I have a SQL Server table with three columns:
Table1
col1 int
col2 int
col3 string
I have a unique constraint defined for all three columns (col1, col2, col3)
Now, I have a .csv file from which I want to add records in this table and the *.csv file can have duplicate records.
I have searched for various options for avoiding duplicates in above scenario. Below are the three options which are working well for me. Please have a look and throw some ideas on pros/cons of each method so I can choose the best one.
Option#1 :
Avoiding duplicates in the first place i.e. while adding objects to the list from csv file. I have used HashSet<T>
for this and overridden below methods for type T:
public override int GetHashCode()
{
return col1.GetHashCode() + col2.GetHashCode() + col3.GetHashCode();
}
public override bool Equals(object obj)
{
var other = obj as T;
if (other == null)
{
return false;
}
return col1 == other.col1
&& col2 == other.col2
&& col3 == other.col3;
}
option #2
Having List<T>
instead of HashSet<T>
.
Removing duplicates after all the objects are added to List<T>
List<T> distinctObjects = allObjects
.GroupBy(x => new {x.col1, x.col2, x.col3})
.Select(x => x.First()).ToList();
option #3
Removing duplicates after all the objects are added to DataTable
.
public static DataTable RemoveDuplicatesRows(DataTable dataTable)
{
IEnumerable<DataRow> uniqueRows = dataTable.AsEnumerable().Distinct(DataRowComparer.Default);
DataTable dataTable2 = uniqueRows.CopyToDataTable();
return dataTable2;
}
Although I have not compared their running time, but I prefer option#1 as I am removing duplicates as a first step - so moving ahead only with what is required.
Please share your views so I can choose the best one.
Thanks a lot!