3

I have many arrays of longs long[] and I need to serialize them and save them to disk for later read, notice that each array has to be modified from time to time, but writes are infrequent while reads are frequent. Usually my application needs only a small number of those loaded into memory at the same time. Edits to each array can happen in batch in memory before the array are stored back to disk. Each array has from hundreds up to a million elements. In my application it is critical that loading the desired arrays into memory happens very fast.

In my case the long values in each array are, on average, quite near one to the other, i.e., the difference from one value to the next - if sorted within a single array- is less than an integer.

The solution adopting a trie-like structure as presented here seems not applicable for my case, since in that solution the array values are known and never change.

This solution here tells me to use ByteBuffer and LongBuffer to speed up I/O, but my idea would be to also store such arrays in the most compact way in order to speed-up the time needed to load them into main memory by reducing the size of what I need to read. The intuition is to store the values sorted and store the difference between one value and the next one, which is - on average- within the Integer range and as such take less space. But since this is not always true, I still cannot store all the values as integers, so this direction doesn't seem promising. Am I missing something obvious?

What is the most efficient way, in I/O time, to achieve this?


EDIT In general, regarding performance as solely I/O time, without considering space on disk, this question has better answers.

Community
  • 1
  • 1
Kuzeko
  • 1,545
  • 16
  • 39
  • How much in-memory processing do you do? You might from left field consider a `MappedByteBuffer` mapped to a `LongBuffer`, which eliminates the I/O altogether except for the elements you actually access. – user207421 Mar 04 '16 at 23:20
  • @EJP I do not understand your question and your suggestion. Can you clarify it? Thanks. – Kuzeko Mar 07 '16 at 15:18
  • My question was 'how much in-memory processing do you do?'. What part of that don't you understand? My suggestion was to use a virtual array of longs that is backed directly by a disk file. This will work or not work acceptably fast depending on how much computation you do with the data. – user207421 Mar 08 '16 at 02:20
  • Those arrays are accessed and iterated through some hundred of thousands times within a second or two. – Kuzeko Mar 08 '16 at 10:13
  • NOTE: @apangin solution is correct, nonetheless, under 10M nodels, normal serialization and deserialization is much faster than this method. – Kuzeko Mar 16 '16 at 14:09

2 Answers2

2

You seem to be placing a lot of emphasis on compactness and speed. To get these to a true minimum level is going to take a lot of optimization. And by a lot, I mean more than your typical developer ever has to deal with.

Instead of doing this yourself, you would be wise to research existing database solutions. The developers of these databases dedicate years to understanding the most efficient way to perform these operations, and the overhead is much lower than you might think. Not to mention the correctness and reliability you get for free.

I would use a stock database solution (just whip out a mysql, maria, or postgres instance and send it to town) and see if that meets your performance metrics. If it doesn't, find what specific metrics it doesn't meet and tune it to those. The things you're asking for require specialized knowledge of your data and the ability to experiment with it, something no one over the internet can do (or would be expected to do for free.)

corsiKa
  • 81,495
  • 25
  • 153
  • 204
  • This seems to have attracted a down vote. I don't particularly mind, but I would be interested to know what about this answer could be improved! – corsiKa Mar 05 '16 at 19:33
  • 2
    It's my vote, sorry. You suggest looking at one of existing databases, but I don't see how this answers the question. Are you aware of a DBMS good at compacting int64 series? Why do you think relational databases may help here, when the described scenario is more about key-value usage? Why do you imply that the author hasn't already researched the possibility of using DBMS? Given suggestions are fine as a comment, but not constructive enough as an answer. – apangin Mar 06 '16 at 01:16
  • 1
    @apangin Nothing in the question indicates OP has looked into existing database solutions and why the most common ones don't meet the criteria, so it's very likely OP hand't considered it. As far as 'compacting', the truth of it is that until OP has a solution that works functionally and can run benchmarks, optimizing for things like disk usage and speed are worthless. As far as relational vs nosql dbs, my experience with them has been they deal with text records better than binary, but by all means, try out a Mongo instance. But use an existing solution instead of rolling your own first. – corsiKa Mar 06 '16 at 02:58
  • I asked explicitly to do it in *vanilla* Java, DBMS's are too slow for any type of I/O processing i need (I already tried in many cases), and a burden for my application. I do not need indexing capabilities, I do not need query processing. And will not be compact enough. – Kuzeko Mar 07 '16 at 15:20
  • 1
    @Kuzeko I've run into this exact problem before: we had to load hundreds of millions of polygons from disk in a Beowulf cluster. We solved it by dedicating one of the nodes to being an off-the-shelf mysql database. We only had 15 nodes instead of 16, but the simulations went over an order of magnitude faster. It was just better at handling data than any of the engineers could do by hand. And that was in C++, which in my experience is much faster with io than java is. What is your use-case? What does the data represent? That could offer optimization clues if it's *that* important. – corsiKa Mar 07 '16 at 15:30
  • Also what is the total size of the payload? Like, if this was all written to disk, what's the size of that folder (before being zipped...) – corsiKa Mar 07 '16 at 15:31
2

You may still encode array elements as ints with the following addition:

    // The first int is the array length
    buf.putInt(array.length);

    long prev = 0;
    for (long next : array) {
        if (next - prev <= Integer.MAX_VALUE) {
            // Delta is small. Change the sign and encode as int.
            buf.putInt((int) (prev - next));
        } else {
            // Delta does not fit in 31 bits. Encode two parts of long.
            buf.putInt((int) (next >>> 32));
            buf.putInt((int) next);
        }
        prev = next;
    }

Note that 31-bit delta will be encoded as negative int. During decoding the highest (sign) bit will tell if the value is delta or a raw 63-bit long. In the latter case you read next int and compose a 63-bit long from two ints:

    // The first int is the array length
    long[] array = new long[buf.getInt()];

    long next = 0;
    for (int i = 0; i < array.length; i++) {
        int delta = buf.getInt();
        if (delta <= 0) {
            // Negative sign means the value is encoded as int delta.
            next -= delta;
        } else {
            // Positive sign means the value is encoded as raw long.
            // Read the second (lower) part of long and combine it with the higher part.
            next = (long) delta << 32 | (buf.getInt() & 0xffffffffL);
        }
        array[i] = next;
    }

This works if all values in the array are positive. If there are both positive and negative values, split them into two arrays.


By the way, a streaming compression like GZIP (or a faster alternative like LZ4) will also work well if the neighbour values are close. See GZIPOutputStream.

apangin
  • 92,924
  • 10
  • 193
  • 247
  • Using negatives was also an Idea that came to my mind, how would you rad them back then? Could you give an example using the compression you suggest: which class/methods should I use? – Kuzeko Mar 07 '16 at 16:14
  • @Kuzeko I've updated the answer with the example of coding/decoding logic. Also added the reference to GZIPOutputStream for the alternative approach. – apangin Mar 08 '16 at 01:50