Memory efficient way to store lot duplicates of integer > 127

Question

I want to parse a file and keep it in-memory as Map<aID, Set<bID>>.

unique_a_IDs = 50.000;
unique_b_IDs = 1.000;
avg_set_length = 50;

As you can see, all set in summary will keep unique_a_IDs * avg_set_length = 2.500.000 of bIDs. Where each bID is from 0 to 1000. So in average each bID will be stored 2500 times. And I don't want JVM allocate memory 2500 times for each integer.

Is there any trick to keep that data structure memory-efficient?

The problem is that I can't (at least I don't know how yet) to use java's integer/string pools. Integer pool works only for numbers in range -128...127. String pool works only for compile time constants, but I read my bIDs from file.

Code example

import java.util.*;

public class MemoryTest {

    private final static Integer A_IDS_AMOUNT = 65536;
    private final static Integer B_IDS_AMOUNT = 1000;
    private final static Integer AVERAGE_SET_LENGTH = 50;
    private final static Random rand = new Random();

    public static void main(String [] args) {
        Map<Integer, Set<Integer>> map = new HashMap<>(A_IDS_AMOUNT);
        for (int i = 0; i < A_IDS_AMOUNT; i++) {
            Set<Integer> set = genRandomSet();
            map.put(i, set);
        }
        // Where SizeOf is premain class which use java instruments
        long size = new SizeOf().deepsize(map) / (1024 * 1024);
        System.out.println("Bytes used by object: " + size + " Mb"); //results in 175 Mb
    }

    private static Set<Integer> genRandomSet() {
        Set<Integer> set = new HashSet<>(AVERAGE_SET_LENGTH);
        for (int i = 0; i < AVERAGE_SET_LENGTH; i++) {
            set.add(rand.nextInt(B_IDS_AMOUNT));
        }
        return set;
    }
}

Use a `TreeSet` to canonicalize first? (You'd end up allocating lots, but then most instances could be immediately garbage collected.) — Jon Skeet, Oct 25 '16 at 10:40
@JonSkeet pls explain. I want to access `Set` by `aID` with O(1) time — VB_, Oct 25 '16 at 10:42
Or create a cache manually by using an `Integer[]` filled with numbers from 0 to 1000 and when you get a bId, you can store `array[bId]` instead of bId. — assylias, Oct 25 '16 at 10:42
It would really help if you'd provide a [mcve], instead of the sort of pseudo-code description you've got, especially using a period as a thousands separator in what looks like code. Without a clearer description of the issue, it's very hard to help you. — Jon Skeet, Oct 25 '16 at 10:43
@assylias how that will help JVM not allocate memory 2500 times for each integer? You propose to sve memory by not using so expensive data structures as Map/Set, right? — VB_, Oct 25 '16 at 10:44
I would not care about the integers. A reference to an object is at least 32 bits, this is the same space a primitive integer needs. — Henry, Oct 25 '16 at 10:55

score 2 · Accepted Answer · answered Oct 25 '16 at 10:56

2

There's java.lang.Integer.IntegerCache.high system property in Java 7 and higher that you can set (e.g. -Djava.lang.Integer.IntegerCache.high=<size>) to cache Integers up to a higher-than-default value - see source code for java.lang.Integer.IntegerCache.

However I doubt that will help you much since you'll still have much more memory consumed by the Map and Sets.

answered Oct 25 '16 at 10:56

Jiri Tousek

12,211
5
29
43

JVM still have to keep refence to an integer pool. Reference take 4 bytes as integer, what's the benefit? – VB_ Oct 25 '16 at 12:16
1

Reference might take just 2 bytes, even on 64bit systems - see http://stackoverflow.com/a/981130/5375403 – Jiri Tousek Oct 25 '16 at 13:38

score 1 · Answer 2 · answered Oct 25 '16 at 11:06

When you create a set to associate to an element of your map, you can check if previously you already built the same set. In case, you can associate this set to the element of the map. In this way, duplicated sets are stored just once. Building time could be expensive, but at the end, you have a more compact structure (e.g map.get(idx1) is the same set/object of map.get(idx2)). If instead sets are - all - different, I don't think that you have any chance to optimize it.

Memory efficient way to store lot duplicates of integer > 127

Code example

2 Answers2