3

I've parsed a grid of Excel data and built up an object model. There's 32 columns and 100,000 rows. I've been asked to check for rows with duplicate data and report them. For my implementation I'm doing the following:

  1. Using Tasks I'm building an array of Tuple with row id and concatenated cell contents.
  2. I loop through the resultant array and using a HashSet, I try to insert the concatenated value into the HashSet:
  3. If HashSet.Add() passes, I create a new entry in my Dictionary> result set for it.
  4. If HashSet.Add() fails I add that row id to the existing entry in my my Dictionary> result set

Step 1 takes 0.09s, while the rest is taking 822s to process :/ Can anyone spot where I can chop this time down with a more appropriate choice of collections or algorithms?

Code is below:

var results = new Dictionary<string, IList<int>>(numberOfRows);
var hashSet = new HashSet<string>();
var duplicateErrors = new List<string>();

for (var row = firstRow; row <= lastRow; row++)
{
    var row1 = row;
    taskArray[count++] =
    Task<Tuple<int, string>>.Factory.StartNew(() => GetCompleteRowData(row1, tableRawDataHolders));
}

foreach (var task in taskArray)
{
    if (hashSet.Add(task.Result.Item2))
    {
        results.Add(task.Result.Item2, new List<int>() { task.Result.Item1 });
    }
    else
    {
        results[task.Result.Item2].Add(task.Result.Item1);
    }
}

and

public Tuple<int, string> GetCompleteRowData(int row, IEnumerable<ITableRawDataHolder> tableRawDataHolders)
    {
        return new Tuple<int, string>(row, string.Join("",
            tableRawDataHolders.Where(c => c.Row == row).Select(c => c.Value).ToArray()));
    }

and

public class TableRawDataHolder : ITableRawDataHolder
{
    public int Row { get; }
    public int Column { get; }
    public string Value { get; }

    public TableRawDataHolder(int row, int column, string value)
    {
        Row = row;
        Column = column;
        Value = value;
    }
}
Neil Humby
  • 253
  • 4
  • 15
  • It actually has nothing to do about `Hashset` vs `Dictionary` performance. The title looks like similar, but read the question carefully. The OP asks about looking for duplicates among 200000 rows. – Yeldar Kurmangaliyev Aug 17 '16 at 08:59
  • 1
    *"Step 1 takes 0.09s, while the rest is taking 822s to process :/"*. It is actually not. It takes 0.09s to asynchronously start your tasks. However, when you try to access `task.Result`, it blocks the thread. – Yeldar Kurmangaliyev Aug 17 '16 at 09:00
  • I voted to reopen since @YeldarKurmangaliyev had some good arguments. Still, [this post](http://stackoverflow.com/q/2728500/993547) is useful. – Patrick Hofman Aug 17 '16 at 09:10
  • Thank you for the replies. I'm not looking specifically for HashSet vs Dictionary, I'm open to any suggestions that will improve the processing time. Yeldar is there a way to enumerate the tasks with priority on those that have already completed? For example I'd be blocking while waiting for Task7 to complete but Tasks 11-18 could already have finished. Seems to negate the benefit of the parallel processing. – Neil Humby Aug 17 '16 at 11:24
  • I don't think you need to go with that level of precision (1 record == 1 task), unless every single record takes milliseconds. You can use Parallel.ForEach on your whole input collection and let .net automatically split workload. You just need take care and make sure output collection with data loading result is thread safe. Alternatively you can use .AsParallel() method in LINQ chain, that will also parallelize processing and take care of combining output. – dlxeon Aug 17 '16 at 12:40

1 Answers1

2

In this situation issue is not in Dictionary or Hashset performance.

Overhead comes from the way you read your data in GetCompleteRowData and work with tasks.

  • It seems you enumerate full collection each time when you need to get next record converted.
  • For every next record you create a task that itself adds some small overhead. Until task is ended it just waits when you use task.Result.
  • Also it is not clear how fast your ITableRawDataHolder returns data.

To demonstrate pure hashset/dictionary performance I've created test where I iterate through array of already prepared Tuples. This code takes just 32ms on my machine (i7 quad core).

const Int32 numberOfRows = 200000;
var inputData = GetInputData(numberOfRows);
var results = new Dictionary<string, IList<int>>(numberOfRows);
var hashSet = new HashSet<string>();

var sw = Stopwatch.StartNew();
foreach (var dataItem in inputData)
{
    if (hashSet.Add(dataItem.Item2))
    {
        results.Add(dataItem.Item2, new List<int>() {dataItem.Item1});
    }
    else
    {
        results[dataItem.Item2].Add(dataItem.Item1);
    }
}
Console.WriteLine(sw.ElapsedMilliseconds);

Here is how I generate test data (it includes some actual duplicates)

private static List<Tuple<int, String>> GetInputData (int numberOfRows)
{
    var result = new List<Tuple<int, String>>(numberOfRows);
    var rnd = new Random();
    for (var i = 0; i < numberOfRows; i++)
    {
        // Once in 100 records we'll have not unique value
        if (result.Count > 0 && rnd.Next(100)%1 == 0)
        {
            result.Add(new Tuple<int, string>(i,  result[rnd.Next(result.Count)].Item2));
        }
        else
            result.Add(new Tuple<int, string>(i, Guid.NewGuid().ToString()));
    }
    return result;
}
dlxeon
  • 1,932
  • 1
  • 11
  • 14
  • Thank you so much for your example! It got me thinking to change my input and wrap it in a IDIctionary> with the key being the row number. I no longer need to query with LINQ through all the data for every row and it's cut the processing time from 822 seconds to 22 seconds. Really appreciate your help on this. – Neil Humby Aug 18 '16 at 08:30
  • I'm glad that helped. Although 22 seconds still looks pretty big. You told you got data from Excel, it is possible you are reading that chunk of data from Excel in way that can be optimized. For example, in some cases it may be faster to read whole cell range at once in array, rather than read cell by cell. – dlxeon Aug 18 '16 at 08:36
  • It's an entire sheet, 32 columns by 100,000 rows. I'll take a look at the code retrieving it for further bottlenecks. – Neil Humby Aug 19 '16 at 16:55