4

I have a HashSet containing custom objects generated from reading a binary file. I also have a dictionary generated from reading each row of a DBF file. There's an index property on both that line up with each other. For example, the 10th item in my Dictionary will line up with the 10th item in my HashSet.

I am comparing LARGE amounts of data against each other. There can be anywhere from 10,000 records to 500,000. The application checks the other two files (one binary, the other is a dbf) for differences. It checks the HashCode of the object (which is generated by certain properties, it does this comparison fast and easy)

Here is how I build each individual dictionary (there is a similar one for mod as well):

foreach (DataRow row in origDbfFile.datatable.Rows)
{
    string str = "";
    foreach (String columnName in columnNames)
    {
        str += "~" + row.Field<Object>(columnName);
    }
    origDRdict.Add(d, str);
    d++;
}

The columns between the two files will always be the same. However I can run into two different files with different columns. I essentially output all data into a string for dictionary lookup. I only want to hit the DBF file again if the data is different.

Here is my code for DB lookup. This will find differences, it's just really slow when it runs the ELSE section of my (!foundIt) if block. If I remove it, it only takes one minute to list all not found items.

foreach (CustomClass customclass in origCustomClassList) {
    Boolean foundIt = false;
    if (modCustomClassList.Contains(customclass))
    {
        foundIt = true;
    }
    //at this point, an element has not been found
    if (!foundIt)
    {
        notFoundRecords.Add(customclass);

    } 
    //If I remove this entire else block, code runs fast.
    else //at this point an element has been found
    {
        //
        //check 'modified' dictionary array
        if (!(modDRdict.ContainsValue(origDRdict[i])))
        {
            //at this point, the coordinates are the same, 
            //however there are DB changes
            //this is where I would do a full check based on indexes 
            //to show changes. 
        }
    }

    i++; //since hashsets can't be indexed, we need to increment
}

What I've tried / Other Thoughts

  • Generating a HashSet of custom objects, custom object having an index of an integer, and string being the length of columns and values

  • Removing if (!(modDRdict.ContainsValue(origDRdict[i]))) block makes code significantly quicker. Time to iterate removed records between two 440,000 record files only takes one minute. The dictionary lookup is taking forever!

  • I don't think the foreach loop within the foreach loop is causing too much overhead. If I keep it in the code, but don't do a lookup then it still runs quick.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Evan Parsons
  • 1,139
  • 20
  • 31
  • 13
    Dictionaries are useful (i.e. fast) when you perform ContainsKey not ContainsValue... – digEmAll Jun 06 '13 at 14:32
  • 6
    To expand, `ContainsKey` is o(1) while `ContainsValue` is o(n). – Rotem Jun 06 '13 at 14:33
  • Instead of Contains, you can use TryGetValue. It is about 40% faster, as you can see in this [thread](http://stackoverflow.com/questions/9382681/what-is-more-efficient-dictionary-trygetvalue-or-containskeyitem) – Pierre-Luc Pineault Jun 06 '13 at 14:34
  • So using ContainsKey should make a significant speed increase? I will try this and TryGetValue. – Evan Parsons Jun 06 '13 at 14:36
  • Dictionaries and hash sets are unordered so it doesn't make any sense to talk about the 10th element of either of them. – Lee Jun 06 '13 at 14:36
  • Dictionaries don't know where they put their values; they only know where they put their keys. So what's faster? Searching for something whose position is known, or searching for something whose position is not? – Nolonar Jun 06 '13 at 14:38
  • @Pierre-LucPineault: TryGetValue looks up by key, not by value which is what the OP is asking about. – jason Jun 06 '13 at 14:38
  • @Evan Parsons: ContainsKey is for looking up by key, not by value. You seem confused as to which you need. – jason Jun 06 '13 at 14:40
  • @Jason Oh, you're right he's really looking up by value. Well, that is a weird idea. – Pierre-Luc Pineault Jun 06 '13 at 14:42
  • So if I make the string my key, and the value the integer (where it is in the DBF file) then lookup should be fast? I'm trying to output the current key into the .Contains method, however I'm unsure of how to do this, ie, (!(modDRdict.ContainsKey(origDRdict[i]))) – Evan Parsons Jun 06 '13 at 15:03
  • @EvanParsons: to me it's not entirely clear what are you trying to accomplish... could you post a small example showing what the data structures contain and what's the desired result ? – digEmAll Jun 06 '13 at 17:59

1 Answers1

3

Dictionaries are optimized to look up by key, not by value. If you need to look up by value, you're using the wrong dictionary. You'll need to build either a HashSet on your values to quickly check for containment, or build a reverse dictionary if you need the keys.

jason
  • 236,483
  • 35
  • 423
  • 525