1

I am creating a Trie in memory. Each node contains is a word. It is extremely good performance-wise. But the catch is the memory consumption.

It is 6GB big! I serialized it with protobuf and wrote it to a file that came out to be 150MB.

JSON is 250MB. I was hoping if there is a way to minify the strings? For eg:

enter image description here

As you can see there are duplicates in the first column. Also, it should be reversible.

All the properties/columns are string.

So let's say the table gets converted to :

enter image description here

I think that would save a lot of space. Of course I can do this by inserting each cell in a dictionary first and then assigning it an integer but I do not want to reinvent the wheel unless I have to.

SamuraiJack
  • 5,131
  • 15
  • 89
  • 195
  • 3
    How did you measure a single object in memory to be 6GB, but only 250MB when written to json? Can you give a [mre] of your model code? – gunr2171 Aug 24 '21 at 13:10
  • Sounds like you want to [compress the strings](https://stackoverflow.com/q/16606552/215552)... – Heretic Monkey Aug 24 '21 at 13:11
  • @gunr2171 I did not. I am sure there are other objects in there that makeup 6GB. But I know it's a massive trie. And if I have to optimize for space anywhere I would start there. The reason I did not share code sample is because then it would be like asking multiple question clubbed into one. Right now I am looking for a way to easily and efficiently compress/minify strings – SamuraiJack Aug 24 '21 at 15:37
  • @HereticMonkey I am trying that out as I am typing this. But I am don't think it will help much because each cell is not going to have a lot of repeated characters but there are going to be a lot of cells that have same value. For eg : "ABCD" in half of the cells of first column. – SamuraiJack Aug 24 '21 at 16:05
  • @HereticMonkey nope, it increased the size by 4 times. I am looking for something that will compress the string but the output will remain a string. – SamuraiJack Aug 24 '21 at 16:39

1 Answers1

-1

The idea you want to do is creating a dictionary first with all and than change actual values to dictionary key (that will be smaller).

This approach is used in Zip and other compress algorithms.

Ygalbel
  • 5,214
  • 1
  • 24
  • 32