-4

We have a big trie in our server-side service which has something like tens of millions of nodes. Whole trie takes about 4 gigs of RAM. Until now we used only the basic binary .NET serialization for storing the trie in a file and reconstructing it back in memory. But it's way too slow... What are the better serialization algorithms in our case, some kind of direct mmap-like trick would be great but .NET doesn't permit custom memory allocators. The aim is to minimize saving and especially loading the trie from the file (filesize is not our concern).

Notice: We absolutely can't use relational databases for that because of the latencies.

Update: Well, we've found this similar question Persisting a trie to a file - C. C community seems better suited for this kind of questions ;) => Accepting protobuf.net solution.

Community
  • 1
  • 1
eeq
  • 884
  • 1
  • 6
  • 15
  • Databases are too slow. And sorry, guys, trie: http://en.wikipedia.org/wiki/Trie – eeq Dec 04 '14 at 23:11
  • @DJKRAZE Please learn algorithms before making bad edits. – eeq Dec 04 '14 at 23:12
  • @john: Some data structures, such as elevation maps, would be inappropriate in a database structure. – Sam Axe Dec 04 '14 at 23:14
  • anyway `eeq` good luck it's Miller time over here perhaps you need to grab a drink too may help free the dust from your mind... – MethodMan Dec 04 '14 at 23:14
  • @john Databases are pretty inappropriate in our scenario and many orders of magnitude slower because we do also some kind of regular expression search on the trie. – eeq Dec 04 '14 at 23:19
  • What are kind of data is your trie holding? – hatchet - done with SOverflow Dec 04 '14 at 23:20
  • @hatchet It's a compressed string trie with some additional info/pointers per node. – eeq Dec 04 '14 at 23:21
  • What is the range and average of lengths of the strings? If you added some more detail of your trie into your question, that might help you get a useful answer. – hatchet - done with SOverflow Dec 04 '14 at 23:22
  • @eeq Is the additional info stored in your serialized data? Can this be stored elsewhere, or in a database, etc. with an id to reference it? Sorry, I'm still not quite sure of what you're storing and what it's relevant to. – ProgrammingLlama Dec 04 '14 at 23:22
  • @john Database is not the option because we need to maintain low latencies. Even one disk drive seek is absolutely unacceptable in our case. – eeq Dec 04 '14 at 23:26
  • @hatchet The height is 11 nodes, not perfectly balanced. – eeq Dec 04 '14 at 23:28
  • 1
    @eeq I think we need to know more about what you are storing, especially in the additional nodes. If even one disk drive seek is unacceptable, this is new information but still doesn't tell us what you're trying to achieve or what the data is. – ProgrammingLlama Dec 04 '14 at 23:29
  • Why not investigate graph databases? http://en.m.wikipedia.org/wiki/Graph_database. Hadoop was meant to solve real life problems like what you've described. – code4life Dec 05 '14 at 01:43
  • @code4life Hadoop is a distributed file-system with map-reduce as an another layer. It's inappropriate for this scenario. – eeq Dec 05 '14 at 08:07
  • 1
    I'm just gonna throw this out there: perhaps a memory managed/garbage collected environment isn't suitable for your needs either.. Even with a concurrent GC, you'll likely hit the sort of bottlenecks you appear to be prematurely optimizing for... – Simon Whitehead Dec 05 '14 at 11:03
  • @SimonWhitehead Yes, we've thought about moving it to C++ but there's no budget for such a movement. And no, the project is more than 3 years in production so 'prematurely' doesn't suite this case :) Seems like protobufs or C++. – eeq Dec 05 '14 at 14:58
  • @seq: that's a pity. Trie and map reduce are not that far apart. It's how we solved some issues with huge object graphs with some complex relationships, like market correlations for huge portfolios (a la Russell indexed). Just couldn't loss everything into memory at once, had to divide and conquer with threads. Map reduce helped towards this end. – code4life Dec 07 '14 at 04:52

1 Answers1

2

The accepted "fastest serializer" for .net is ProtoBuf.net by a long way the fastest and smallest serializer. http://damienbod.wordpress.com/2014/01/09/comparing-protobuf-json-bson-xml-with-net-for-file-streams/

PhillipH
  • 6,182
  • 1
  • 15
  • 25