4

Problem

I have a huge collection of strings that are duplicated among some objects. What is need is string interning. These objects are serialized and deserialized with protobuf-net. I know it should handle .NET string intering, but my tests have shown that taking all those strings myself and creating a Dictionary<string, int> (mapping between a value and its unique identifier), replacing original string values by ints, gives better results.

The problem, though, is in the mapping. It is only one-way searchable (I mean O(1)-searchable). But I would like to search by key or by value in O(1). Not just by key.

Approach

The set of strings is fixed. This sounds like an array. Search by value is O(1), blinding fast. Not even amortized as in the dictionary - just constant, by the index.

The problem with an array is searching by keys. This sounds like hashes. But hey, n hashes aren't said to be evenly distributed among exactly n cells of the n-element array. Using modulo, this will likely lead to collisions. That's bad.

I could create, let's say, an n * 1.1-length array, and try random hashing functions until I get no collisions but... that... just... feels... wrong.

Question

How can I solve the problem and achieve O(1) lookup time both by keys (strings) and values (integers)?

Two dictionaries is not an option ;)

ebvtrnog
  • 4,167
  • 4
  • 31
  • 59
  • 3
    Why are two dictionary's no option? You could hide it in a generic [`Map`-Class](http://stackoverflow.com/a/10966684/2441442) and can change the implementation any time, if you think, you have a better way. – Christian Gollhardt May 15 '16 at 03:17
  • There are tools/frameworks to compute perfect hashing functions for a limited set of input strings. If you know them upfront, it is easy to fit them into an array without collisions. – Thomas Jungblut May 15 '16 at 06:00

2 Answers2

0

Two dictionaries is the answer. I know you said it isn't an option, but without justification it's hard to see how two dictionaries doesn't answer your scenario perfectly, with easy to understand, fast, memory-efficient code.

From here, it seems like you're looking to perform two basic operations;

myStore.getString(int); // O(1)
myStore.getIndexOf(string); // O(1)

you're happy for one to be implemented as a dictionary, but not the other. What is it that's giving you pause?

Steve Cooper
  • 20,542
  • 15
  • 71
  • 88
0

Can you use an array to store the strings and a hash table to relate the strings back to their indices in the array?

Your n*1.1 length array idea might be improved on by some reading on perfect hashing and dynamic perfect hashing. Wikipedia has a nice article about the latter here. Unfortunately, all of these solutions seem to involve hash tables which contain hash tables. This may break your requirement that only one hash table be used, but perhaps the way in which the hash tables are used is different here.

Richard
  • 56,349
  • 34
  • 180
  • 251