Memory efficient way to store list of integers

Question

I'm doing some javascript 3D processing, and i have a very large amount of objects (say object A), each one containing some stuff and an array of positive integers, such as [0, 1, 4], [1, 5, 74, 1013], etc. They don't need to have a private value, all objects can share the same list. Thoses numbers can go from 0 to a few thousands, say 65k (a short).

Profiling revealed that thoses arrays are eating a lot of memory. When computing, my program reach more than 2GB of allocated mem, this is not some case of stupid pre-optimisation.

I have 2 leads for memory optimisation:

find a more memory-effective way to store thoses lists (Maybe array of bits in big numbers?)
Find some way to avoid duplicates. For instance, I happened to find that some arrays (like [0,1,2,3,4,5,6]) was present in more than 40 000 objects A. Maybe storing thoses arrays in a tree structure and making my objects A point to it would help?

Do you have a suggestion?

EDIT: I forgot to mention it but it's important: each integer in the list is UNIQUE. EDIT2: The only important thing to retrieve is the SET of integers, the order isn't important.

I'm thinking of converting thoses arrays to "Big Integers" with bitwise operations i.e. create a Big Integer with some class, set the bits 1, 5, 74, 1013, the convert the big int to a string (array of 8 bytes) and store the string, but it won't always be a gain (for instance, the array [4000] will be represented as a 500 byte-long string...)

Scope of the project (useless, but i've been asked for it)

This is supposed to be useless to answer the question, but i've been asked for it several times, i put it here.

I'm building a 3D mesh of volumic objects, to simplify, let's just assume i have a lot of spheres. I know their position (center, ray) and i want to draw them in a single 3D Mesh. For that, i have a memory structure called an Octree that allow me to divide the 3D space in lower cells (the octree nodes) around the surface of my object. I can then build a mesh from thoses cells.

Thoses cells are what i called object A in the description. Each cell contains a list of ids (positive integers) which points to the Sphere objects the cell intersects.

Fact is that profiling showed that thoses cells arrays are retaining several hundred of MB in memory. I would like to reduce that number, by finding a way to remove all duplicates and/or if possible, finding a more effective way to store a list of positive ids (that can go from 0 to 65k).

Maybe you could store the arrays as-is, with dups, then memoize all the functions that may use the arrays to compute something. Big arrays in JS are typically not a problem, it's the methods that you interact with them where you will find performance difference. — elclanrs, Jan 10 '14 at 17:42
My suggestion would be to post your code so that we can actually see what is wrong. No idea what your problem is based on the above... — MaineCoder, Jan 10 '14 at 17:49
I can't actually post thousands of line of codes... I don't even know where to start to reduce this problem to a few lines on SO... — Sebastien, Jan 11 '14 at 17:07
Are you sure the 2GB you mentioned is from your integer lists alone? Because that would mean you have over a billion numbers (you said they were shorts, if they are regular ints you would still need half a billion numbers). That is a LOT of numbers. Maybe you are storing textures somewhere, that would take up a lot of memory.... — Kevin, Jan 15 '14 at 14:38
Anyway, take a look at [this SO question](http://stackoverflow.com/questions/2954349/when-should-i-use-indexed-arrays-of-opengl-vertices). It's about vertex arrays in OpenGL, but the basic principle is the same as what you are trying to do. The optimization mentioned there can also be applied in your situation. — Kevin, Jan 15 '14 at 14:46
No, it's not 2GB only from thoses arrays. I have a heavy processing, thoses arrays takes between 400MB and 600MB. Then i have the cells themselves (object A)... — Sebastien, Jan 15 '14 at 18:05
just another idea. you can sort & convert all numbers in array to a string, and then apply memoization over duplicate strings by your own indexing and finally replace the string with your index. when you try to access those numbers you can read from it memoize cache object. — risyasin, Jan 16 '14 at 12:13
Just out of curiosity: What drove you to do such 3d crunshing in JavaScript in the first place? — Tharabas, Jan 19 '14 at 10:54
What are you doing with the arrays? Are they constantly being modified, or mostly just read from? — Liron, Jan 20 '14 at 08:51

score 3 · Answer 1 · answered Jan 13 '14 at 09:49

3

That sure seems like a lot of memory for arrays of that size, without seeing your source code I would take a close look at where you perform operations on the arrays.

Some array operations will create a copy of the array.
Initializing the array small and expanding it many times while the collection of items in the array increases will generate a large memory overhead.

answered Jan 13 '14 at 09:49

Niklas Arbin

652
6
14

Well, depending of the datas to process, i have between 100k and 800k objects A containing thoses arrays, so despite the 2GB in RAM, the memory consumption of the program seems normal. It doesn't mean that i can't optimise it, but i doesn't look like there's an error somewhere. – Sebastien Jan 13 '14 at 14:57

score 3 · Accepted Answer · answered Jan 16 '14 at 09:59

3

check out this page https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays?redirectlocale=en-US&redirectslug=JavaScript%2FTyped_arrays , i contains low level javascript containers, i think one of them will fit your needs

answered Jan 16 '14 at 09:59

Vlad Nikitin

1,929
15
17

This could indeed reduce the array size. Nice. Will try with Uint16Array. – Sebastien Jan 16 '14 at 13:59
There were a lot of answers. This was not really the big improvment i expected, but it probably was the best one. – Sebastien Jan 20 '14 at 09:07

KevinH · Answer 3 · 2014-01-13T09:35:36.630

1

If there are a significant number of duplicates, you could try using a hash set (one where the key and value are the same) to store the integer lists. That way, if a key already exists in the set, you don't need to add more to it. Your original list of objects would then just contain references to the hash set members instead of the members themselves.

This will add a tiny bit of performance overhead, but if memory is the bottleneck then it might not be an issue.

edited Jan 13 '14 at 09:35

answered Jan 13 '14 at 09:30

KevinH

413
2
7

This looks nice, but how can i find an effective hashing function for a list of integers without collisions? Shoud i convert this list to a string with bitwise operations? – Sebastien Jan 13 '14 at 12:30
I think javascript will automatically call the toString method. The string is unique per set so it should work as the key. Strings internally have their own optimized hash functions so that should be fairly reasonable at minimizing collisions. – KevinH Jan 14 '14 at 00:38

score 0 · Answer 4 · answered Jan 13 '14 at 09:55

Without seeing actual requirements, my first impression is that you should look at ways to optimize the algorithms you use to reduce the memory footprint of the data structures involved, before tackling with implementation details.

Then, I would suggest evaluating the following options:

try to use plain objects as hashes instead of the arrays
rewrite your code in Dart or Java and compile it to JS using the GWT compiler and benchmark the results
try to use WebGL shaders
implement parts of your code using Native Client SDK or NPAPI (but be warned that NPAPI is deprecated)

This is quite a general solution to the problem "reducing your javascript code footprint". Unfortunately, i can't really afford thoses (except maybe the first one, which i didn't really understand). I identified the causes of the "big memory footprint" and i would like to find a different data structure that will solve that as effectively as possible — Sebastien, Jan 13 '14 at 14:53

score 0 · Answer 5 · answered Jan 13 '14 at 11:26

0

Here's my go at this:

If you're in control of your data set, you can try and use the JS numeric (8 bytes long) to store more values (maybe 4 values, each 2 bytes long) - the exact memory gain depends on the actual range; also bitwise operations tend to be quite fast so there should be no performance hit
Also you could try wrapping your values in an Object (e.g. OpNumeric) that is immutable. Each time you need a numeric value, you ask a NumericManager for a wrapped instance and store the reference to the OpNumeric. This way all values of 5 will store references to the same object OpNumeric(5); I'm not sure what the size of a reference is in JS (it probably depends on the implementation & machine capabilities) but it's worth trying it. Also OpNumeric is a good candidate for implementing (1). This will decrease performance a bit and will probably temporarily increase memory usage, but that should only be only while resolving OpNumeric references.
If you generate your arrays on the fly (and not by appending values one by one), maybe it's worth hashing and reusing them. Because you already know the range, you can come up with a hash that would give you as few collisions as possible; and even if you have a couple collisions, it should be fairly CPU friendly to pick the right reference by value by value comparison.

Sorry this is not a straight answer, but with these kinds of things there's no direct way of achieving the goal. Maybe it's worth looking at asm.js. Seems to have a bit in common with what you're doing.

Cheers,

answered Jan 13 '14 at 11:26

marius-O

395
3
15

1/ This was clearly one of the ideas i have in mind. 2/ I don't think it's worth to replace integer values by pointer references. Ref are usually the same size or larger in memory. 3/ This is right again, it's probably worth hashing and reusing thoses arrays. But if i there is a quite perfect bijective hashing function, it would be better not to hash anything and use this to compress the data. – Sebastien Jan 13 '14 at 15:04
The most blocking problem with the bitwise operation is that if you have small arrays with large numbers, say [4000], then you will have 1 integer with the 4000th bit set... So 500 bytes of storage to store just this array... – Sebastien Jan 13 '14 at 15:18
Can you describe what your project is doing a bit more? What are the options & goals. Is it your only goal to reduce memory footprint without changing too much of the logic? E.g. Can you look at partial processing of your model (so that you don't need to load all data at once)? – marius-O Jan 13 '14 at 16:16
I'm quite reluctant to give more information about the project, because i think we're going to loose the focus of what i'm allowed to do/not do. I'll give you some tip here: I have a memory structure called an [octree](http://en.wikipedia.org/wiki/Octree), which nodes are object A, containing thoses lists. I have a big number of nodes and i can't do anything about it. Thoses cells contains a list of integers (object id's) which are basically representing some spheres to be drown. Each list represents the spheres intersection this octree nodes. – Sebastien Jan 13 '14 at 17:37
Since a lot of nodes are intersecting the same spheres, they have the same (duplicated) list in memory. I'm now looking for a way to reduce the memory footprint of my program without affecting too much the access time to thoses lists. Different nodes can share the same list, since they won't modify it. – Sebastien Jan 13 '14 at 17:40

Siva Tumma · Answer 6 · 2014-01-19T12:15:43.847

I am daring to tell a mythological thought.

In this mythological thought,

You will have a static array of 0-65k ints - sorted.
An array containing metadata for each of your A objects, and...
1. slice(x,y) will get chunk of array starting at x and ending at y from the static array of step 1.
2. dupeof(oldarray) will actually return oldarray in runtime.
3. diff that calculates the diff of two arrays
4. sort that will sort an array of ints ... [].sort(function(a,b){ return a-b; });
An array dupes for storing unique duplicate arrays.

On top of this mythology, This is what I am wishing we could do...

When you create a new object (Your A), you will sort its array, slice from the static array from the first value of this A, to the last value of this A, find a diff of the sliced one, and your current A's array, store which ever is smaller.
Considering this scenario, an example entry (metadata for an object of A) will look like : "0-11,[3,5]". In simpler english, it should be understood as construct an array from static array, slice from 0 to 11, and remove 3, 5 from that slice. We would get [0,1,2,4,6,7,8,9,10,11,12].
You have to make your A's constructor pass through this logic while it gets constructed. So, at any point of time, you will only have the metadata of your objects (Your A:s). And your code will read this data to construct your mesh on the fly in the run time.

Please also note that there are a lot of things to address, like there is total ambiguity of slicing and storing the small array... but I believe this metadata (storing data about data instead of data) in these kinds of situations would definitely be a worth spend.

score 0 · Answer 7 · answered Jan 18 '14 at 15:22

I've worked on systems that displayed, in a web page, paginated tables containing up to twenty-five thousand rows of customer data including 10 columns of name, age, address, phone, etc., and as you can imagine, we were looking for ways to optimize the display of that data.

Obviously, building one big searchable, sortable table in a web page and trying to maintain all of that data was a memory hog (you can imagine what it did to IE7/8).

The four solutions I came up with:

Take the JSON data I got from the server, split it into manageable indexed JSON strings and put it in local storage and then access the data as required.
If local storage wasn't available, I'd load a Flash object into the page and it retrieved the data from the server and I used the ExternalInterface to access the data as needed.
If Flash wasn't available, I'd store it in a Java Applet and access it in a similar manner.
Lacking any of those options, then the user had to suffer a little. They'd need to input some search parameters and they'd get to see only the data returned via an ajax call and only in managable chunks. But if they didn't have Local Storage, Flash or Java available, then they got what they deserved. ;)

Ethan Lynn · Answer 8 · 2014-01-18T22:48:25.037

Depending on how sparse your array is, you might want to store ranges of integers where you would normally have continuous sequences of integers in your array (i.e., [[1, 5], [10, 14]] instead of [1, 2, 3, 4, 5, 10, 11, 12, 13, 14] or even [1, 5, 10, 14] since your code can assume it is formatted in pairs). If your array is very sparse you could still use this method by storing the ranges of the gaps in the sequence. To give you an idea:

function IntegerArray(integers) {
  this.ranges = [];
  // Convert integers to ranges (Don't have time to overview algorithm,
  // but I think a good start would be sorting the integers and searching
  // for gaps)
}
IntegerArray.prototype = {
  insert: function (n) {
    // Could achieve O(log n) time complexity with binary search since
    // the ranges are sorted.
    // * n is between two ranges = insert [n, n] range.
    // * n is 1 less than the start of a range = decrease range start by 1.
    // * n is 1 more than the end of a range = increase range end by 1.
  },
  remove: function (n) {
    // Also O(log n) time complexity with binary search.
    // * n has [n, n] range = remove range.
    // * n is at beginning of range = shift range.
    // * n is at end of range = pop range.
    // * n is in middle of range = split range.
  }
};

On the other hand, this may not be what you wan't if you're array has many isolated integers because those will each require two integers instead of one.

After considering this more it occurred to me that you could also store your array as the following:

A starting point: The smallest integer in your array.
An array of integers that represented the size of integer ranges (as positive integers) and gaps (as negative integers) instead of the indexes themselves.

This improves the size of the data structure but would probably decrease the performance of insertion/deletion.

Denis de Bernardy · Answer 9 · 2014-01-19T10:52:43.483

The idea of using a special structure (string or big int with bit mask) to recreate the array each time is, I think, incorrect. This will make you re-create the arrays over and over, and the overhead you'll introduce in doing so (extra CPU time + GC time) is probably not worth it. Much like in a database, a calculated field is fair game for tiny amounts of data, but blows up in your face when you have large amounts -- better store the calculated and directly usable result if performance counts.

The idea of using a hash table has, I think, a bit more merit. But there is a tradeoff here, since it depends on how your spheres are lying around. You say in your question:

Each cell contains a list of ids (positive integers) which points to the Sphere objects the cell intersects.

This, I read as: depending on the sphere distribution, you may end up in pathological cases where computing and retaining an extra integer not only uses extra CPU but also -- and most importantly -- ends up retaining even more objects than you already have. For that reason, I think I'd rule it out too: benefits in most cases until it blows up in your face.

Imo, consider listing the objects directly, instead of using and retaining their IDs. Each object is going to exist anyway, and I presume that each needs to keep the array of intersecting spheres around for performance reasons anyway. So might as well not add the overhead (memory + access time) of having extra IDs around.

More generally speaking (without fully understanding what you're up to, it's hard to be more precise): consider revisiting your data model so it is more memory efficient to begin with; don't retain objects you don't need, and try using an object graph instead of a hash of objects referenced by IDs.

Lastly, consider whether you actually need all of these objects to reside in memory at all times. Because if not, you could just as well put the stuff in a persistent store, and only keep the stuff you actually need for your current use.

LSerni · Answer 10 · 2014-01-19T14:24:05.453

an array of positive integers, such as [0, 1, 4], [1, 5, 74, 1013], etc. They don't need to have a private value, all objects can share the same list. Thoses numbers can go from 0 to a few thousands, say 65k (a short).

One way

Replacing the array of ints with an array of flags into an int array would make sense if the arrays were of large objects. But since they're short integers the advantage is likely minimal.

A possibility could be to replace the numbers with arrays of 32-bit flags of numbers, i.e. for example

[ 10, 13, 1029 ]

could be translated into blocks of size 32, wherein 10 and 13 would fall in the same block, and could be encoded by (1<<10) | (1<<13). But to do so, you need each list to contain all counters, which means 2048 32-bit integers, most of which would be zero.

A variant (sort of run-length encoding) in case you have most of the numbers very near one to the other - within 32 integers of each other) could be to store each "run" of numbers as a pair, one describing the first number N, the other encoding the numbers following, up to N+32. So:

 15, 21, 27, 29, 32, 40, 44

becomes:

 15, (21-16=)5, (27-16=)11, 13, 16, 24, 28

and then:

 15, 1<<5 + 1<<11 + 1<<13 + 1<<16 + 1<<24 + 1<<28

so that you can store 6 extra shortints in one 32-bit value. This only works if the arrays are "clumped". If you have most numbers dispersed, farther away than 32, it definitely won't be worth the effort.

Also, you have to factor in the cost of dealing with such numbers, i.e., what the cost is of reinflating a compressed array in order to do something with it.

Another way

You could try and keep a structure with all arrays used so far, with a method to retrieve an array given its ID. A tree structure would allow to store common-prefix arrays, but then you'd have troubles when recovering them by ID. So I'm afraid the best you can do is to use a very long array of arrays:

[ [ 0, 1, 4 ],             // ID#0
  [ 1, 5, 74, 1013 ],      // ID#1
]

which means that finding whether an array exists or not is going to cost. You would need to also add an index:

{ "1": { "5": { "74": { "1013": { "-1": 1 // The array ending here has id 1

Now you store [ 1, 5, 74, 1013, 1089 ], when you arrive to the key 1013 you find no 1089 key, so you know this isn't a duplicate, store it in the main array, recover its index - say 1234 - and add "1089": { "-1": 1234 } to the 1013 key.

Adding an array is reasonably fast, as is accessing its values.

Memory-wise, is it worth it? Each array instead of N integers is now made of N integers plus (N+1) dictionaries with at least one integer each, so I'd say the cost is between trebled and quadrupled. If the duplicate arrays are very many, this may be advantageous; otherwise it might not be. If less than, say, one third of the arrays are duplicates, it's likely it won't be.

Additionally, you now can't easily modify the arrays since you'd need to reindex the tree; and removing an array means reindexing twice, one to remove the array, another to locate the index of the last array; update its index to the one just vacated, and move the last array to fill the gap, then decrease by one the array list.

 [ 1 ]       [ 1 ]      [ 1 ]
 [ 2 ]       [ 2 ]      [ 2 ]
 [ 3 ]  -->        -->  [ 5 ]
 [ 4 ]       [ 4 ]      [ 4 ]
 [ 5 ]       [ 5 ]

Twist

Instead of the array and tree above, you could hash the list and use the hash as an id. This would risk collisions, though, unless you used the stringified list (stringified by you, or by JS - the first method thriftier, the second faster) as a key. In that case, you would need on average, say, four bytes per each integer in the key; smaller numbers would weigh less though, as "1.12.13.15.29" is only some 14 bytes or thereabouts. The memory footprint would be around trebled for the unique arrays and zeroed for the duplicates; again it all comes to how many duplicates there are compared to the non-duplicates.

Vadim Kirilchuk · Answer 11 · 2014-01-19T19:38:41.897

Something similar is done for lucene/solr. Feel free to check https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/DocSetCollector.java It's a java, but i bet you can use the same idea for javascript.

In short, initially they store an int array

// in case there aren't that many hits, we may not want a very sparse
// bit array.  Optimistically collect the first few docs in an array
// in case there are only a few.
final int[] scratch;

But later if number of matched documents too much they switch to a BitSet which is just

bits = new OpenBitSet(maxDoc);

Here maxDoc is a maximum number of items in list. I don't know if you could find the number in your task, but maybe you know that there is never more that N integers in list. (looks like you mentioned 65k).

So, if you have numbers, say 1,2,3,4,5,6,7,8,9,10 then, when you have integers, you have 10 * 32 bits which is 320 bits. But if you allocate bitset with size 10, then it's just 10 bits. If 1st bit is true, then you have 1 in your list, if 10th bit is true, then you have 10 in your list. So your threshold for switching is 2048 elements. 2048 integers is 65536 bits, but with bitset you could encode 65536 elements instead of just 2048.

Note: items are unique in such bitset, and order is, obviously, ascending.

score 0 · Answer 12 · answered Jan 19 '14 at 23:34

Store each array as a string. A string is essentially an array of shorts (UTF16), and with interning the runtime will avoid extra storage for identical arrays/strings. Use String.fromCharCode() to convert a UTF16 numeric value to a single-character string. Use String.charCodeAt() to extract numbers from the string.

Since JavaScript is dumb about things like special Unicode characters, such as combining accent characters and even invalid characters, length will work as you expect. That is, it will give you the number of "charCodes" not the number of Unicode characters.

score 0 · Answer 13 · answered Sep 29 '15 at 15:12

0

You can use a bitset. There are efficient libraries implementing bitsets in JavaScript. See for example:

https://github.com/lemire/FastBitSet.js

answered Sep 29 '15 at 15:12

Daniel Lemire

3,470
2
25
23

score -1 · Answer 14 · edited May 23 '17 at 12:16

find a more memory-effective way to store thoses lists

Note that JavaScript does not have integers, only floating point numbers.

(Maybe array of bits in big numbers?)

I'm pretty sure JavaScript doesn't have any bit-wise operators. (Bitwise operations on floating point make no sense)

Find some way to avoid duplicates.

That would avoid a lot of memory if you could 1) detect the dups, 2) point them to the same underlying object. This should be pretty trivial if you are constructing the structure ahead of time. But detecting dups at runtime will reduce performance. You'll have to benchmark to see how much. Off the top of my head, I'd say look into storing the arrays in a Trie. Your objects will have a direct pointer to an array, but when adding a new array, you go thru the Trie. This prevents dups.

Do you have a suggestion?

If you're running in a browser, look into a project called asm.js. This will let you actually use ints.

JavaScript *does* have bitwise operators https://www.geeksforgeeks.org/javascript-bitwise-operators/ — AncientSwordRage, Jan 27 '22 at 02:42

Memory efficient way to store list of integers

14 Answers14

One way

Another way

Twist

Linked