I've parsed a grid of Excel data and built up an object model. There's 32 columns and 100,000 rows. I've been asked to check for rows with duplicate data and report them. For my implementation I'm doing the following:
- Using Tasks I'm building an array of Tuple with row id and concatenated cell contents.
- I loop through the resultant array and using a HashSet, I try to insert the concatenated value into the HashSet:
- If HashSet.Add() passes, I create a new entry in my Dictionary> result set for it.
- If HashSet.Add() fails I add that row id to the existing entry in my my Dictionary> result set
Step 1 takes 0.09s, while the rest is taking 822s to process :/ Can anyone spot where I can chop this time down with a more appropriate choice of collections or algorithms?
Code is below:
var results = new Dictionary<string, IList<int>>(numberOfRows);
var hashSet = new HashSet<string>();
var duplicateErrors = new List<string>();
for (var row = firstRow; row <= lastRow; row++)
{
var row1 = row;
taskArray[count++] =
Task<Tuple<int, string>>.Factory.StartNew(() => GetCompleteRowData(row1, tableRawDataHolders));
}
foreach (var task in taskArray)
{
if (hashSet.Add(task.Result.Item2))
{
results.Add(task.Result.Item2, new List<int>() { task.Result.Item1 });
}
else
{
results[task.Result.Item2].Add(task.Result.Item1);
}
}
and
public Tuple<int, string> GetCompleteRowData(int row, IEnumerable<ITableRawDataHolder> tableRawDataHolders)
{
return new Tuple<int, string>(row, string.Join("",
tableRawDataHolders.Where(c => c.Row == row).Select(c => c.Value).ToArray()));
}
and
public class TableRawDataHolder : ITableRawDataHolder
{
public int Row { get; }
public int Column { get; }
public string Value { get; }
public TableRawDataHolder(int row, int column, string value)
{
Row = row;
Column = column;
Value = value;
}
}