3

As the title says, is there a way for me to have a sparse int array in Java? Something like int[] or Integer[], but each element in the array should not take up space until it is actually written to with a non-zero value. This is because I want to optimise memory as much as possible (as most of the elements in the array would be 0/empty), yet have the option to look up by index, as with a standard array. I've considered the below options before asking this question:

  • I am aware of options such as https://developer.android.com/reference/android/util/SparseIntArray. The problem is that they seem to be Android-specific, and I could not find a desktop equivalent.
  • My understanding is that int[] and Integer[] are not sparse arrays; even if I were to keep the elements of an Integer array as null, it would still take up space, which wouldn't work for me.
  • The other option is to use hashmaps. While that works, it does take up non-insignificant amount of space (which is an issue in this particular case).
  • Options like linked lists or an arraylist (of Pairs for instance) won't work since lookups are difficult that way.
Leaderboard
  • 371
  • 1
  • 10
  • 1
    The third option is to use linked lists. The fourth option is to use a Row-Column-Value array. See https://www.geeksforgeeks.org/what-is-meant-by-sparse-array/ – Robert Harvey Jan 01 '23 at 17:58
  • 2
    Note that the source code for the Android `SparseArray` class is [available online](https://android.googlesource.com/platform/frameworks/base/+/master/core/java/android/util/SparseArray.java), under an Apache license. – Robert Harvey Jan 01 '23 at 18:01
  • @RobertHarvey The problem with a linked list (which I considered) is that lookups by index won't work (without a manual iteration), which is difficult. A "Row-Value" array won't work either, as that simply reduces to a hashmap (with row as the key) from an implementation point of view (or alternatively something like the `Pair` class). It should be noted that my array is one-dimensional. If I have misunderstood you, please clarify. – Leaderboard Jan 01 '23 at 18:04
  • We could use a `Map` to simulate a sparse array. This could be extended for matrices, e.g. by using a `Map, Integer>` with `public record Pair(T first, U second)`. – Turing85 Jan 01 '23 at 18:07
  • @Turing85 That is my current approach, but it appears to have a significant penalty in memory usage from my understanding (due to its relative complexity - see https://stackoverflow.com/questions/25560629/sparsearray-vs-hashmap for an example). In this case, I have to optimise memory usage as much as possible; if the array was relatively full, I would have gone with a primitive array. – Leaderboard Jan 01 '23 at 18:09
  • 1
    How did you confirm your suggestion that the `Map` is the root cause for the memory consumption? What limitations wrt. memory do you have? – Turing85 Jan 01 '23 at 18:13
  • @Turing85 My program is a bunch of hashmaps chained together (https://github.com/Leader-board/Wikimedia-statistics/blob/main/Global%20user%20table%20generator/Main.java). As it currently stands, this takes up almost 100 GB of RAM. I want to change the nested hashmap into one with a sparse array as its value. – Leaderboard Jan 01 '23 at 18:15
  • So you're okay with your arrays not being able to store `0`? Or how else do you want to distinguish between present and not-present? Note that using reference types such as `Integer` to store null will require even more memory (4 bytes for the number and even more bytes for the reference) – knittl Jan 01 '23 at 18:19
  • And which parts **do** actually consume that memory? Have analyzed the memory consumption at runtime, e.g. through VisualVM? – Turing85 Jan 01 '23 at 18:20
  • "So you're okay with your arrays not being able to store 0?" - That is fine by me; in other words, it's fine to treat 0 as "not-present". – Leaderboard Jan 01 '23 at 18:20
  • > And which parts do actually consume that memory? It's the hashmaps - I haven't used VisualVM, but can consider running it if it helps. I do monitor the memory usage when running the program. – Leaderboard Jan 01 '23 at 18:21
  • Yes, it would help. In particular, it would help to know which `Map` consumes the most memory, and why. For further analysis, however, we would most probably also need a description on what the program actually does. – Turing85 Jan 01 '23 at 18:23
  • From what I can see, it seems that there might be a bunch of non-closed `Stream`s. I am not sure whether `System.setOut(...)` closes the current `System.out`-stream. – Turing85 Jan 01 '23 at 18:27
  • @Turing85 * It isn't `System.setOut`; that phase kicks in only after importing. The majority of RAM is used up in the importing stage. Additionally, I changed it to buffered writes, with the same result. * I am reasonably sure that it's the nested hashmap `static HashMap> users = new HashMap<>(); ` that is most intensive in memory. This can be shown by the fact that `analyser()` does not increase memory usage substantially despite it having a treemap of its own. I need to run a profiler though to fully answer your previous question. – Leaderboard Jan 01 '23 at 18:30
  • is replacing the inner map with an, for example, `public record WikiEditCount(String wikiName, int edits){}` an option? This would mean to change `Map> users` to a `Map> users`. – Turing85 Jan 01 '23 at 18:38
  • @Turing85 Would that not potentially increase the memory usage instead, plus require `WikiEditCount` to be hashable anyway? I could be mistaken there though. – Leaderboard Jan 01 '23 at 18:40
  • It is defined as a record, thus is it is implicitly hashable. My gut instinct tells me that the overhead of a `Set` of records is lower than the overhead of a `Map`. But to be sure, we'd need to run some tests... If you can provide me a test data set of relevant size, I can take a look. – Turing85 Jan 01 '23 at 18:42
  • 2
    "Options like linked lists or an arraylist (of Pairs for instance) won't work since lookups are difficult that way." . Creating software sometimes uses the art of compromise. You might have to sacrifice some desirable characteristic at the expense of improving some other characteristic. A sparse array is used to reduce memory requirements at the expense of some reduction in the time needed to access an element. – Old Dog Programmer Jan 02 '23 at 15:30
  • 1
    If you are asking if there is a sparse array data structure that has O(1) get and set operations and also has low space overheads ... there isn't one. You need to compromise on some of the requirements. – Stephen C Jan 09 '23 at 02:48
  • @StephenC _Some_ compromise is indeed fine to me. For example, amortised O(1) (as the accepted answer does) or O(log n) are both OK. I was simply looking for the best option. – Leaderboard Jan 09 '23 at 13:37

1 Answers1

0

How large are your collections? Do note that Integer (boxed reference type) adds significant memory overhead. Standard Java collection types (Map, List) only support reference types.

What does significant mean? An int takes 4 bytes of memory, a long takes 8 bytes. Compare that to an Integer (16 bytes) or a Long (24 bytes). Reference: What is the storage cost for a boxed primitive in Java?

GNU Trove implements most collections with primitive types, which increases cache locality and decreases memory consumption. Perhaps this will work for you.

knittl
  • 246,190
  • 53
  • 318
  • 364
  • * How large are your collections? The keys can have up to 71 million rows with up to 950 elements as its value for each key (i.e, the `users` hashmap). It's that "up to 950 elements" that I want to replace with a sparse array. I am aware that memory usage will still be high, but believe that it can be significantly lowered from the current ~100 GB, and hence my idea of a sparse array. * While I don't think GNU Trove answers my question, it could be useful and I'll need to look into it further. – Leaderboard Jan 01 '23 at 18:36
  • @Leaderboard GNU trove would allow you to use a HashMap without the overhead of reference-types. Why do you need to index though? How will you know which indices exist? – knittl Jan 01 '23 at 18:44
  • 2
    71 million rows?! You mean `users.size()` returns 71 million?! – Turing85 Jan 01 '23 at 18:46
  • 1
    Another question: why do you need to keep everything in memory? – knittl Jan 01 '23 at 18:48
  • 1
    *This is because I want to optimise memory as much as possible (as most of the elements in the array would be 0/empty)*. Just curious.. With 950 items per row, how many cells in the rows would actually have a non-zero value? Would there be sufficiently few to justify the overhead of an alternative data structure like a map? Even a homegrown map with minimal overhead? – WJS Jan 01 '23 at 19:11
  • You want a class that can emulate a 2D array with 67.45 billion elements? – Old Dog Programmer Jan 02 '23 at 02:13
  • 1
    @Leaderboard please keep in mind that using _boxed primitives_ adds significant memory overhead (12 bytes per integer, 16 bytes per long). I have extended my answer to make this clear. – knittl Jan 02 '23 at 16:17
  • I tried GNU Trove, and (combined with some other optimisations I did) reduced the memory usage to ~23 GB, which is much more reasonable. So in that way, your suggestion did work, and also answered my question (since this means that GNU Trove is close enough to a sparse array). @Turing85:Yes WJS: The majority of the cells would be zero. OldDogProgrammer: No. I was simply looking for a sparse array data structure equivalent, which knittl answered for me. – Leaderboard Jan 08 '23 at 12:31
  • 100GB→23GB. Wow, that's a 77% reduction, or ¾! – knittl Jan 09 '23 at 13:58