0

I'm working on a school project where I'm building a geographical map from nodes. These nodes are loaded from an XML format from OpenStreetMap.org.

There are a lot of these nodes, and therefor I have to optimize memory whenever possible. One such optimization would be storing their id's as int instead of long.

The nodes have their ID values far beyond what Integers support, therefor I cannot directly create objects from them utilizing int or Integer. However, I am going to use around 40.000.000 nodes in total (which is less than the total amount of Integers possible).

My question is therefor if there is an intelligent way to apply a mathematical function to the Long values, that will turn them into ints. I need to be able to use this function once on all nodes when I save them in memory, and then later on when something refers to them by their original long value from the XML, I'll need to (again) find their new Integer value by the function. I've attempted the following:

int myInt = (int) myLong-Integer.MIN_VALUE... in order to bring the long value back into the int spectrum, however the longs are still too large to support integer format after this.

I've also attempted keeping a Map of "Long -> Integer", that I can lookup the long in, in order to get the int value. However this is nonsensical since I would then STILL be storing the Longs in memory (which is exactly what we are trying to prevent).

EDIT: As comments below asked for, the range of long values are indeed more than 4 billion apart. I don't have the exact Maxvalue and Minvalue available right now but the difference between them are about twice as large as the amount of available integers. (Therefor no simple subtraction would work)

EDIT2: Found the data: enter image description here

Markus B
  • 83
  • 1
  • 10
  • 1
    Generally, it's impossible to map 64 bit to 32 bit integers without loosing any information. But if you can share possible details on your long ids (range of values or something else) it may become possible. But for 40.000.000 records of you will use 32 bit instead of 63 bits it will save you about 150MB (and it's not [guranteed](https://stackoverflow.com/a/61472080/4655217)). By reading this question Knuth's cite comes to mind ["premature optimization is the root of all evil"](https://www.quora.com/What-do-engineers-mean-by-Premature-optimization-is-the-root-of-all-evil) – geobreze Apr 03 '22 at 21:12
  • Absent some useful detail about the structure of the 'long' values, any solution will involve storing an 'int to long' mapping, which as you observe, just makes it worse. The basic problem is that any int value will have 4 billion possible longs associated with it, so you need other information (the detail alluded to above) to pick only one. – passer-by Apr 03 '22 at 21:13
  • What range of values do your ids take? If it's a range of more than 4 billion numbers, you're probably out of luck. – Dawood ibn Kareem Apr 03 '22 at 21:17
  • The map can still help if you have more than one of each ID sitting in memory. – shmosel Apr 03 '22 at 21:28
  • the memory for a map may be far greater than the memory savings you got when using 32-bit ints, so it may not worth it at all in this case – phuclv Apr 04 '22 at 01:31

1 Answers1

0

When the longs are not tied to specific ranges, like 400_000L upwards for customers, 500_000L upwards for invoices, etcetera, then you need to store a mapping from long to int.

In memory you could use:

Map<Long, Integer> longToIntIdMap = new HashMap<>();
Map<Integer, Long> intToLongIdMap = new HashMap<>();

One could search an ideal hash function to have the least of collisions mapping to the same int id. Normally one could in the case of a collision use a second hash function and so on.

One could use Long.hashCode possibly ensuring non-negative numbers.

int getId(long id) {
    int intId = longToIntIdMap.computeIfAbsent(id,
        (k, v) -> {
            int hc = Long.hashCode(id); // First hashing.
            int intId = hc;
            while (intToLongIdMap.containsKey(intId)) {
                ++intId; // Second hashing.
                if (intId == hc) {
                    throw new IllegalStateException("All ints used");
                }
            }
            intToLongIdMap.put(intId, id);
            // Implicit longToIntIdMap.put(id, intId);
            return intId;            
        });
    return intId;
}

I personally would take the effort to count the collisions for a couple of hash functions. There are even some solutions to do this systematically. I have no links though.


The above is expensive, especially as Long and Integer are wrappers. IDs are unique in XML, so one could count them first and use a fixed size long array and as hashCode % array.length as initial index. Maybe wrapped in a class IntLongHashMap. Also the reverse map can be created afterwards.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • But wouldn't that map use the same, if not more, memory than just keeping the ids as longs? Since the hashmap would atleast contain every long. – Markus B Apr 04 '22 at 05:15
  • Yes, especially as Long and Integer are wrappers. IDs are unique in XML, so one could count them first and use a fixed size array and as hashCode % array.length as initial index. Maybe wrapped in a class IntLongHashMap. Also the reverse map can be created afterwards. – Joop Eggen Apr 04 '22 at 06:39