6

I'm required to work on a serialization library in Java which must be as fast as possible. The idea is to create various methods which will serialize the specified value and its associated key and puts them in a byte buffer. Several objects that wrap this buffer must be created since the objects that need to be serialized are potentially alot.

Considerations: I know the Unsafe class may not be implemented in every JVM, but it's not a problem. Premature optimization: this library has to be fast and this serialization is the only thing it has to do. The objects once serialized are tipically small (less than 10k) but they are alot and they can be up to 2Gb big. The underlying buffer can be expanded / reduced but I'll skip implementation details, the method is similar to the one used in the ArrayList implementation.

To clarify my situation: I have various methods like

public void putByte(short key, byte value);
public void putInt(short key, int value);
public void putFloat(short key, float value);

... and so on...

these methods append the key and the value in a byte stream, so if i call putInt(-1, 1234567890) my buffer would look like: (the stream is big endian)

     key       the integer value  
[0xFF, 0xFF, 0x49, 0x96, 0x02, 0xD2]

In the end a method like toBytes() must be called to return a byte array which is a trimmed (if needed) version of the underlying buffer.

Now, my question is: what is the fastest way to do this in java?

I googled and stumbled upon various pages (some of these were on SO) and I also did some benchmarks (but i'm not really experienced in benchmarks and that's one of the reasons I'm asking for the help of more experienced programmers about this topic).

I came up with the following solutions:

1- The most immediate: a byte array

If I have to serialize an int it would look like this:

public void putInt(short key, int value)
{
    array[index]   = (byte)(key >> 8);
    array[index+1] = (byte) key;
    array[index+2] = (byte)(value >> 24);
    array[index+3] = (byte)(value >> 16);
    array[index+4] = (byte)(value >> 8);
    array[index+5] = (byte) value;
}

2- A ByteBuffer (be it direct or a byte array wrapper)

The putInt method would look like the following

public void putInt(short key, int value)
{
   byteBuff.put(key).put(value);
}

3- Allocation on native memory through Unsafe

Using the Unsafe class I would allocate the buffer on native memory and so the putInt would look like:

public void putInt(short key, int value)
{
  Unsafe.putShort(address, key);
  Unsafe.putInt(address+2, value);
}

4- allocation through new byte[], access through Unsafe

I saw this method in the lz4 compression library written in java. Basically once a byte array is instantiated i write bytes the following way:

public void putInt(short key, int value)
{
   Unsafe.putShort(byteArray, BYTE_ARRAY_OFFSET + 0, key);
   Unsafe.putInt(byteArray, BYTE_ARRAY_OFFSET + 2, value);
}

The methods here are simplified, but the basic idea is the one shown, I also have to implement the getter methods . Now, since i started to work in this i learnt the following things:

1- The JVM can remove array boundary checks if it's safe (in a for loop for example where the counter has to be less to the length of the array) 2- Crossing the JVM memory boundaries (reading/writing from/to native memory) has a cost. 3- Calling a native method may have a cost. 4- Unsafe putters and getters don't make boundary checks in native memory, nor on a regular array. 5- ByteBuffers wrap a byte array (non direct) or a plain native memory area (direct) so case 2 internally would look like case 1 or 3.

I run some benchmarks (but as I said I would like the opinion / experience of other developers) and it seems that case 4 is slightly (almost equals) to case 1 in reading and about 3 times faster in writing. It also seems that a for loop with Unsafe read and write (case 4) to copy an array to another (copying 8 bytes at time) is faster than System.arraycopy.

Long story made short (sorry for the long post):

case 1 seems to be fast, but that way I have to write a single byte each time + masking operations, which makes me think that maybe Unsafe, even if it's a call to native code may be faster.

case 2 is similar to case 1 and 3, so I could skip it (correct me if I'm missing something)

case 3 seems to be the slowest (at least from my benchmarks), also, I would need to copy from a native memory to a byte array because that's must be the output. But here this programmer claims it's the fastest way by far. If I understood correctly, what am I missing?

case 4 (as supported here) seems to be the fastest.

The number of choices and some contradictory information confuse me a bit, so can anyone clarify me these doubts?

I hope I wrote every needed information, otherwise just ask for clarifications.

Thanks in advance.

Community
  • 1
  • 1
MastErAldo
  • 634
  • 3
  • 12
  • 29
  • Any particular reason for not using an existing serialization library, such as [Kryo](https://github.com/EsotericSoftware/kryo)? – thkala Sep 06 '14 at 22:17
  • Thanks for pointing that out. However I'd like not to rely on an external library since the serialization protocol is primarily for primitive data types and already serialized ones (byte, short integer arrays...). It may be an overkill to use Kryo. And another point is that luckily I'm free to experiment, so even if maybe I'm reinventing the wheel, it would be at least a usefull exercise to me. – MastErAldo Sep 06 '14 at 22:39
  • 1
    What is the target medium for your serialized data? Serializing to a byte buffer is not very useful on its own. Unless you are doing something *really* interesting such as RDMA, serializing to most targets (network, disk etc) is I/O-bound. And if you are just doing IPC locally, I'd first look in the possibility of refactoring my application to avoid serialization/deserialization completely. Can you provide some details on what you are doing, and what necessitates such performance concerns? – thkala Sep 06 '14 at 22:54
  • It's a general purpose library to serialize primitives and primitive arrays, yes, it's mainly for network/disk but also IPC. The library serializes this way (I'll take an integer as an example): [x,x,x,x, y,y, w,w, z, d,d,d,d, e] x is the length of the serialized stream, y the message ID, w the integer key, z the type ID, d the integer, e the end of stream marker,adding other values means to add more w,z and d. The point is since i can experiment and choosing one of the 4 cases listed has more or less the same complexity,I would like to learn a bit more if anyone has had similar experiences – MastErAldo Sep 06 '14 at 23:09
  • Case 4 should be the fastest. I suspect your bechmark is wrong. [one-nio](https://github.com/odnoklassniki/one-nio) library uses the similar approach. And yes, it is noticeably faster than Kryo. – apangin Sep 07 '14 at 04:08
  • 1
    If you're not experiences with benchmarking, be aware that it's pretty hard. By all means, use [JMH](http://openjdk.java.net/projects/code-tools/jmh) or [Caliper](http://code.google.com/p/caliper), otherwise your results are nearly sure plain garbage. – maaartinus Sep 07 '14 at 05:06
  • Sorry, I just realized that in the final considerations I inverted case 3 with 4. I edited my question to correct it, so I agree that case 4 should be the fastest, when I have the time i'd like to post the benchmark I made to see if it's meaningful. Thanks to maaartinus for having notified JMH and Caliper, I'll surely give them a try next time I have to benchmark something – MastErAldo Sep 07 '14 at 13:18
  • FWIW, I think you'll get more performance benefit from carefully managing the underlying byte[] (i.e. pre-allocating a single array and re-using it, don't do defensive copying) than you'll see between case 1 and case 4. Also, I just did a good bit of reading, and it sounds like the performance between case 1 and 4 are highly dependent on your specific tests (case 1 sometimes faster, case 4 sometimes faster). – Kevin Day Sep 07 '14 at 14:55
  • As said I'd like to post what is a slightly modified version of the benchmark I did. As maaartinus points out it may be garbage, but maybe it may be of some kind of interest, at least for pointing out what I did wrong. http://pastie.org/9534291 – MastErAldo Sep 07 '14 at 18:54
  • 1
    I listed [here](http://stackoverflow.com/questions/24882946/java-loop-gets-slower-after-some-runs-jits-fault/24889503#24889503) several reasons why a benchmark may produce misleading results. Your benchmark seems to follow similar antipatterns. – apangin Sep 07 '14 at 20:05
  • Thanks for the link, I'll try to make a better one following your advices and using Caliper – MastErAldo Sep 07 '14 at 20:26

1 Answers1

1

Case 5: DataOutputStream writing to a ByteArrayOutputStream.

Pro: it's already done; it's as fast as anything else you've mentioned here; all primitives are already implemented. The converse is DataInputStream reading from a ByteArrayInputStream.

Con: nothing I can think of.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • I saw the sources of ByteArrayOutputStream and it basically serializes the data in the same way I do manually in case 1, so it would a nice advantage to avoid the bit shifting stuff. However even with Unsafe I have methods which can automatically put an int/float/... directly in an array and avoiding the stream objects instantiation. Any particular reason why I should avoid cases 3 / 4 apart the fact that using the Unsafe class may be, well, unsafe? – MastErAldo Sep 07 '14 at 01:09
  • Is there any reason why you *shouldn't* avoid them, when this solution is already written, documented, supported, non-deprecated, tested, ...? – user207421 Sep 07 '14 at 01:23
  • 2
    @EJP normally I'd agree with you, *but* in this case, performance is a big consideration. Performance issues with BAOS: 1. the user would need to allocate a new BAOS, which would create a new byte[]. Re-using an existing byte[] will be much more efficient. 2. BAOS.toByteArray() is synchronized - if locking isn't an issue, that's just overhead. 3. BAOS.toByteArray() actually copies the results into a new byte[] - so we wind up with *two* byte[] creations, plus an array copy. – Kevin Day Sep 07 '14 at 14:39
  • 1
    @KevinDay You're speculating. I suggest you benchmark and measure. – user207421 Sep 08 '14 at 01:24