1

Basically I need to store in memory a set of string and be able to get the unique integer associated with one string and the string associated with one integer.

So it seem like I need something that implement Map<String,Integer> as well as Map<integer,String> but I don't need to decide the integer as long as they are unique.

One option I was thinking of doing is store a Trie where the leaf node contain the integer and a array of pointer to leaf-node sorted by integer value.

Is there a more compact way to store this in memory while still having good retrieval performance?

skyde
  • 2,816
  • 4
  • 34
  • 53
  • Can you explain better what you mean by "I don't need to decide the integer as long as they are unique" – Justin Nov 19 '13 at 01:08
  • The integer could be generated for me when I add a new string to the dictionary, the same way a database generate primary key. The client of the API don't need to provide it's own integer when adding a new String. – skyde Nov 19 '13 at 01:11
  • 1
    You should look into the HAT-trie. A HAT-trie is a type of burst trie (which is efficient for storing leaf nodes), designed to be cache-conscious. See this pdf: crpit.com/confpapers/CRPITV62Askitis.pdf Also, possible duplicate: http://stackoverflow.com/questions/3986158/mapping-from-string-to-integer-performance-of-various-approaches?rq=1 – bdean20 Nov 19 '13 at 03:23

2 Answers2

0

since you did not specify any other function other than lookup, you can use an unordered map (hashmap) to store both

If you control the unique integer: then make it consecutive starting at 0, then you can replace the int to string lookup with a simple array (vector)

both hash and vector have great locality - it will be hard to beat that with a trie.

Glenn Teitelbaum
  • 10,108
  • 3
  • 36
  • 80
  • This will work but will take more space then the Trie because It will have to store all the string – skyde Nov 19 '13 at 00:55
  • @skyde since you are able to have an array of strings (int to string) and a unsorted map(string to int) it will likely take less memory if you include all the pointers you will be adding and your array can be pointers to the same string as you hash key – Glenn Teitelbaum Nov 19 '13 at 03:25
0

This is just a passing idea but what about a trie-to-reverse-trie map?

Map:

a   = 5
and = 51
ant = 52
as  = 53
after = 54

Your string trie:

        a
      / | \
     n  s  f
    / \    |
   d   t   t
           |
           e
           |
           r

Your integer trie:

       5
    / / \ \
   1  2  3  4

Now, you keep a pointer from the "word" nodes in each trie to point to each other in both representations.

So... e.g. Node 'r' in string trie points to node '4' in integer trie and vice versa.

a <-> 5 (root)
d <-> 1
t <-> 2
s <-> 3
r <-> 4

So, if you wanted to get the integer from a string; you'd search the string trie until you get to the node which represents the word. Now, following the pointer to the node which represents last digit of the integer. You go from that node to the root which will give you each digit of the integer (in reverse).

To reverse the integer, you essentially just keep track of how many times you've had to follow your parent pointer.

Say you had a link like this in your integer trie: (e.g. integer=14235)

5->3->2->4->1->root

You'd keep a height and result variables:

current=5; height=1;          result=5 
current=3; height*=10==10;    result+=(current*height)==35
current=2; height*=10==100;   result+=(current*height)==235
current=4; height*=10==1000;  result+=(current*height)==4235
current=1; height*=10==10000; result+=(current*height)==14235

If you can keep the branching factor down to <=9 in the integer trie, you only have to store a byte in each node of the integer trie (outside of the pointers).

Obviously, you can do the reverse for integer-to-string....

Just a passing thought. Also, using a "compact trie" would be better but I used a traditional trie for example purposes.

Justin
  • 4,196
  • 4
  • 24
  • 48
  • It seem that using a Finite State Transducers would be more space efficient see https://issues.apache.org/jira/browse/LUCENE-2792 – skyde Nov 21 '13 at 21:50