1

We have a Hashtable (specifically the C# Dictionary class) that holds several thousands/millions of (Key,Value) pairs for near O(1) search hits/misses.

We'd like to be able to flush this data structure to disk (serialize it) and load it again later (deserialize) such that the internal hashtable of the Dictionary is preserved.

What we do right now:

  1. Load from Disk => List<KVEntity>. (KVEntity is serializable. We use Avro to serialize - can drop Avro if needed)
  2. Read every KVEntity from array => dictionary. This regenerates the dictionary/hashtable internal state.
  3. < System operates, Dictionary can grow/shrink/values change etc >
  4. When saving, read from the dictionary into array (via myKVDict.Values.SelectMany(x => x) into a new List<KVEntity>)
  5. We serialize the array (List<KVEntity>) to disk to save the raw data

Notice that during our save/restore, we lose the internal tashtable/dictionary state and have to rebuild it each time.

We'd like to directly serialize to/from Dictionary (including it's internal "live" state) instead of using an intermediate array just for the disk i/o. How can we do that?

Some pseudo code:

// The actual "node" that has information. Both myKey and myValue have actual data work storing
public class KVEntity
{
    public string myKey {get;set;}
    public DataClass myValue {get;set;}
}

// unit of disk IO/serialization
public List<KVEntity> myKVList {get;set;} 

// unit of run time processing. The string key is KVEntity.myKey
public Dictionary<string,KVEntity> myKVDict {get;set;} 
DeepSpace101
  • 13,110
  • 9
  • 77
  • 127
  • Possible duplicate http://stackoverflow.com/questions/495647/serialize-class-containing-dictionary-member – mockinterface Jan 04 '14 at 01:35
  • http://stackoverflow.com/questions/67959/net-xml-serialization-gotchas – Mitch Wheat Jan 04 '14 at 01:38
  • you should try this http://stackoverflow.com/questions/14436606/protobuf-net-object-reference-deserialization-using-dictionary-a-reference-trac – Fredou Jan 04 '14 at 01:38
  • Fastest speedup I ever got with hashtables was selecting a capacity that exceeds the expected size. The re-balancing of hashtables is very costly, so if it can be avoided, things go a lot quicker. – spender Jan 04 '14 at 02:34

2 Answers2

7

Storing the internal state of the Dictionary instance would be bad practice - a key tenet of OOP is encapsulation: that internal implementation details are deliberately hidden from the consumer.

Furthermore, the mapping algorithm used by Dictionary might change across different versions of the .NET Framework, especially given that CIL assemblies are designed to be forward-compatible (i.e. a program written against .NET 2.0 will generally work against .NET 4.5).

Finally, there are no real performance gains from serialising the internal state of the dictionary. It is much better to use a well-defined file format with a focus on maintainability than speed. Besides, if the dictionary contains "several thousands" of entries then that should load from disk in under 15ms by my reckon (assuming you have an efficient on-disk format). Finally, a data structure optimised for RAM will not necessarily work well on-disk where sequential reads/writes are better.

Your post is very adamant about working with the internal state of the dictionary, but your existing approach seems fine (albiet, it could do with some optimisations). If you revealed more details we can help you make it faster.

Optimisations

The main issues I see with your existing implementation is the conversion to/from Arrays and Lists, which is unnecessary given that Dictionary is directly enumerable.

I would do something like this:

Dictionary<String,TFoo> dict = ... // where TFoo : new() && implements a arbitrary Serialize(BinaryWriter) and Deserialize(BinaryReader) methods

using(FileStream fs = File.OpenWrite("filename.dat"))
using(BinaryWriter wtr = new BinaryWriter(fs, Encoding.UTF8)) {

    wtr.Write( dict.Count );

    foreach(String key in dict.Keys) {

        wtr.Write( key );
        wtr.Write('\0');
        dict[key].Serialize( wtr );
        wtr.Write('\0'); // assuming NULL characters can work as record delimiters for safety.
    }
}

Assuming that your TFoo's Serialize method is fast, I really don't think you'll get any faster speeds than this approach.

Implementing a de-serializer is an exercise for the reader, but should be trivial. Note how I stored the size of the dictionary to the file, so the returned dictionary can be set with the correct size when it's created, thus avoiding the re-balancing problem that @spender describes in his comment.

Dai
  • 141,631
  • 28
  • 261
  • 374
  • Although a non-answer, I really like your reasoning, so +1. Let me chew on this before marking it an answer. What are the other optimizations you're talking about? Curious! – DeepSpace101 Jan 04 '14 at 03:17
1

So we're going to stick with our existing strategy given Dai's reasoning and that we have C# and Java compatibility to maintain (which means the extra tree-state bits of the C# Dictionary would be dropped on the Java side anyways which would load only the node data as it does right now).

For later readers still interested in this I found a very good response here that somewhat answers the question posed. A critical difference is that this answer is for B+ Trees, not Dictionaries, although in practical applications those two data structures are very similar in performance. B+ Tree performance closer to Dictionaries than regular trees (like binary, red-black, AVL etc). Specifically, Dictionaries deliver near O(1) performance (but no "select from a range" abilities) while B+ Trees have O(logb(X)) where b = base is usually large which makes them very performant compared to regular trees where b=2. I'm copy-pasting it here for completeness but all credit goes to csharptest.net for the B+ Tree code, test, benchmarks and writeup(s).

For completeness I'm going to add my own implementation here.

Community
  • 1
  • 1
DeepSpace101
  • 13,110
  • 9
  • 77
  • 127