Please, I would like to optimize this code in C#, if possible.
When there are less than 1000 lines, it's fine. But when we have at least 10000, it starts to take some time... Here a little benchmark :
- 5000 lines => ~2s
- 15000 lines => ~20s
- 25000 lines => ~50s
Indeed, I'm looking for duplicated lines.
Method SequenceEqual to check values may be a problem (in my "benchmark", I have 4 fields considered as "keyField" ...).
Here is the code :
private List<DataRow> GetDuplicateKeys(DataTable table, List<string> keyFields)
{
Dictionary<List<object>, int> keys = new Dictionary<List<object>, int>(); // List of key values + their index in table
List<List<object>> duplicatedKeys = new List<List<object>>(); // List of duplicated keys values
List<DataRow> duplicatedRows = new List<DataRow>(); // Rows that are duplicated
foreach (DataRow row in table.Rows)
{
// Find keys fields values for the row
List<object> rowKeys = new List<object>();
keyFields.ForEach(keyField => rowKeys.Add(row[keyField]));
// Check if those keys are already defined
bool alreadyDefined = false;
foreach (List<object> keyValue in keys.Keys)
{
if (rowKeys.SequenceEqual(keyValue))
{
alreadyDefined = true;
break;
}
}
if (alreadyDefined)
{
duplicatedRows.Add(row);
// If first duplicate for this key, add the first occurence of this key
if (!duplicatedKeys.Contains(rowKeys))
{
duplicatedKeys.Add(rowKeys);
int i = keys[keys.Keys.First(key => key.SequenceEqual(rowKeys))];
duplicatedRows.Add(table.Rows[i]);
}
}
else
{
keys.Add(rowKeys, table.Rows.IndexOf(row));
}
}
return duplicatedRows;
}
Any ideas ?