1

I'm creating a trie, which my application will hold in it memory. Trie will have a lot of nodes and I'm thinking about how to reduce space usage. Of cause I will use trie to DAWG algorithm to reduse number of nodes, but as far as I know it's not enough.

Here is a node class

class Node{
  char letter;
  boolean EOW; // end of word
  Node child; // first child
  Node next; // next Node on this level
}

As far as I know object of this class will have 14 Bytes (2 Bytes given for char, 4 for boolean variable and 2*4 will be reserved for references)

I think that I can replace char by byte. That will save 1 Byte. However I don't know how much time that will take in type casting. And likely this is a bad desigin.

Also boolean takes 4 Bytes, perhaps you know what I can use instead of boolean?

So I need you to help me reduce size of nodes. Thanks in advance.

Andy Turner
  • 137,514
  • 11
  • 162
  • 243
Mike Herasimov
  • 1,319
  • 3
  • 14
  • 31
  • Can you do it in an object-oriented way, so you have `EndOfWordNode extends Node`, implicitly indicating the boolean value? – Andy Turner Apr 12 '15 at 11:13
  • @AndyTurner with the way tries are typically constructed, that will probably make things more difficult. – user253751 Apr 12 '15 at 11:14
  • @immibis "more difficult" sure. I'd rather not do it. But if space is the overriding concern, maybe eating the difficulty is the price. – Andy Turner Apr 12 '15 at 11:16
  • 1
    Why do you need to reduce the size of Nodes? – user253751 Apr 12 '15 at 11:16
  • because there will be a lot of them and them will take a couple of MBytes – Mike Herasimov Apr 12 '15 at 11:19
  • 2
    Say a node is about 24 bytes and you have 2 megabytes of them... then if you can save 6 bytes you'll save about half a megabyte. If saving half a megabyte matters, then Java is definitely not the language to use. – user253751 Apr 12 '15 at 11:20
  • However some "optimizations" will take more time for typical operations – Mike Herasimov Apr 12 '15 at 11:21
  • How little RAM does your system have? – Andy Turner Apr 12 '15 at 11:22
  • application will run on Android devices including handsets – Mike Herasimov Apr 12 '15 at 11:25
  • I'm think that it will be 32MBytes – Mike Herasimov Apr 12 '15 at 11:27
  • there is a way to store only the next node and just compute the prev node as needed but i would have to look it up there various size-optimisation schemes for tries and depending on your use-case you might need a different trie altogether – Nikos M. Apr 12 '15 at 11:40
  • You can use one bit of the `char` to represent the boolean, unless it may be bigger than `'\u7fff'`. – Bubletan Apr 12 '15 at 11:47
  • @Bubletan it interesting, can you say me how? And yes my letters will fits in 5 bits – Mike Herasimov Apr 12 '15 at 11:50
  • Also note that in Hotspot, the most commonly used Java VM, each object contains a one word class pointer and one word of other overhead. This adds a total of 2*4 = 8 bytes on a 32 bit system (and also on 64 bit systems if compressed pointers can be used). – Lii Apr 12 '15 at 12:07

2 Answers2

2

If letter takes only 5 bits and eow one bit, you can pack them in a single byte to save memory.

char letter = ...;
boolean eow = ...;

byte packed = (byte) ((eow ? 0b10_0000 : 0) | letter);

letter = (char) (packed & 0b1_1111);
eow = (packed & 0b10_0000) != 0;
Bubletan
  • 3,833
  • 6
  • 25
  • 33
1

If you don't need the weirder half of UTF-16 characters, you can use the highest bit of letter as the EOW marker.

For example, here the eoWletterA variable has the letter 'a' encoded with the EOW bit:

char eoWletterA = 'a' + 0x8000;
char letter = (char) (eoWletterA & 0x7FFF);
boolean eow = BigInteger.valueOf(eoWletterA).testBit(15);

Your trie should be encapsulated properly. Make sure that the EOW bit can't be accidentally set when storing a character to the trie.

UPDATE: Note that removing the boolean variable from the Node may or may not make a difference in the memory footprint of a Node object in a JVM. You can examine the object memory footprint with the following tool: https://stackoverflow.com/a/52682/1207523

Community
  • 1
  • 1
Lindlof
  • 2,152
  • 2
  • 17
  • 26
  • Thank you :) Surely it can be optimized, with readability trade-offs, if you have high performance requirements. – Lindlof Apr 12 '15 at 12:41
  • Both objects and individual fields are padded on most architectures. `char`+`boolean` and `byte` is likely to occupy single word. Make sure to test this really gives you measurable performance gain. On my machine there is no difference in used memory. – Piotr Praszmo Apr 12 '15 at 12:48
  • @Banthar Trying this using the answer: http://stackoverflow.com/a/52682/1207523. The Node object is 24 bytes in my JVM. I notice that even when removing the boolean and the char variables, the memory usage of the Node object remains 24 bytes in my JVM. – Lindlof Apr 12 '15 at 13:28