How to deep copy a map with initial capacity?

Question

Serialization and deserialization have been a preferred and accepted method of deep copying objects with complicated graphs (How do you make a deep copy of an object in Java? etc.) where the copy constructor/factory method approach is not well suited.

However, this approach does not work with maps that specified an initial capacity. My previous question on this (Why does specifying Map's initial capacity cause subsequent serializations to give different results?) shows that the resulting objects are not equal, and the answer shows that their byte representation is different: if serializing gives byte[] b1, then deserializing and serializing again will give byte[] b2 which is different from b1 (at least 1 element is different). This is in contrast to the usual behavior of de/serializing objects.

The readObject and writeObject methods that control the de/serialization process are private and so cannot be overriden - this was probably done on purpose (Hashmap slower after deserialization - Why?).

I am using the de/serialization deep copy approach on an object holding many other objects, including maps. I am also performing comparison and changes on their byte array representation. As long as the maps are not initialized with the initial capacity parameter, all works well. However, attempting to optimize the maps by specifying the initial capacity breaks this approach as mentioned above.

I would like to know if it's possible to circumvent this issue and if so how.

Why does it "break approach"? What is "broken" about having a different capacity? Don't say iteration order, because iteration order of `HashMap` is *specifically* defined to be arbitrary. — Andreas, Aug 28 '16 at 16:39
I have the same question as Andreas. Can you clarify what issues you're seeing with this approach? And actually, a small representative example *would* really help here. — Jason C, Aug 28 '16 at 16:44
Also, just as a quick suggestion, you *do* have the opportunity to e.g. rebuild maps and such in [`readResolve()`](https://docs.oracle.com/javase/7/docs/platform/serialization/spec/input.html#5903) post-deserialization. So assuming it's truly broken and it's not another issue you are having, and precluding a different approach entirely, you could potentially reconstruct your maps with appropriate parameters in `readResolve()`. — Jason C, Aug 28 '16 at 16:46
If speed is not a major concern (which is probably the case if you're using serialization to copy), you can create an additional map with a specified initial capacity and manually transfer the entries from the deserialized map. — shmosel, Aug 28 '16 at 16:58
Alternatively, you can skip deserialization and iterate over the source map to copy each key and value to a new map with a specified initial capacity. — shmosel, Aug 28 '16 at 17:02
@Andreas See my edited question. A copy-paste-ready MCVE is given in my previous (linked) question. — user1803551, Aug 28 '16 at 17:53
@JasonC See my comment above to Andreas. I will look at `readResolve()`. Thanks. — user1803551, Aug 28 '16 at 17:53
@shmosel While speed is not a concern. that doubles the memory requirement of each map, and memory could be a concern. About copying the map in its byte array form, I will need to know exactly which bytes to copy and I don't know what are the guarantees that I can find the position. As me edited question says, the byte array itself is different, so I risk corrupted copies, which was my initial problem. — user1803551, Aug 28 '16 at 18:01
@user1803551 Yeah but it's still not clear what your actual *issue* is. While it's true that `State` serializes to different byte arrays when an initial capacity is set, it serves no purpose to care that the serialized data is different. The only thing that actually matters in your application is that the deep copy is done properly. I don't actually see any functional problem with your observations here other than that the serialized data is different for different initial capacities, but you have zero reason to care about the precise serialized bytes, only the deserialization result. — Jason C, Aug 28 '16 at 18:02
@user1803551 It doesn't double the memory requirements because the deserialized copy will get GC'd after you're done transferring. And my second suggestion was to copy the map's elements manually, without serialization altogether. — shmosel, Aug 28 '16 at 18:05
On the other hand if your deserialized object ended up with, say, *incorrect items* in the map, that would be a real problem. But that doesn't seem to be happening. Your deserialized results are still a correct deep copy. So if your only actual problem is different long term performance characteristics due to a deserialized map having a different initial capacity than its original counterpart, then copy everything to a map with the correct initial capacity in `readResolve`. Aside from that there isn't really any other problems here. Different byte arrays does not imply risk of corrupted copy. — Jason C, Aug 28 '16 at 18:05
@JasonC I am transitioning between states by changing their bytes: (1) Ser. `state1` into `byte[] b1`. (2) Ser. `state2` into `byte[] b2`. (3) Save a `byte[] dif` with the difference between them and discard `state2` and `b2`. (4) At some point recreate `state2` by changing `b1` using `dif`. If the states contain an initial capacity map, an exception is thrown (I think it's "invalid stream header"). Do you want me to post the code that does it or is the reasoning enough? — user1803551, Aug 28 '16 at 18:16
@JasonC Though honestly, if there is a way to keep differences between objects in a way that allows to rebuild one from another without byte arrays, I'm open for suggestions. I keep an initial state and a list of differences that allow me to move from one state to its adjacent states. — user1803551, Aug 28 '16 at 18:18
@shmosel The double memory requirement will be during copying, yes. Copying the elements without de/serialization is basically a copy constructor/factory method approach. — user1803551, Aug 28 '16 at 18:20
@user1803551 That's essentially correct. But elements in the map could deep copy using the standard serialization procedure. Basically I had in mind something like this (pseudocode): `clone(obj) { if obj is map { create new map; foreach entry in obj map.put(clone(key), clone(value)); return map; } else return deserialize(serialize(obj)); }` — shmosel, Aug 28 '16 at 18:34
And about the double memory, that only applies to the actual map structure. The *data* will only be a shallow copy of the deserialized version. — shmosel, Aug 28 '16 at 18:37
@user1803551 If the exception is `StreamCorruptedException` with a message "invalid stream header" that means the bytes being deserialized really have been corrupted. In your example you serialize to `b1` and `b2`; presumably these deserialize successfully. If applying `dif` to `b1` doesn't reconstruct `b2` exactly, then that implies to me there's a bug in the difference algorithm -- either in its creation or its application to `b1`. — Stuart Marks, Aug 29 '16 at 15:22

Jason C · Answer 1 · 2016-08-28T21:45:24.430

Ok, so, first of all, you're barking up the wrong tree by focusing on the fact that specifying an initial capacity leads to different serialized bytes. In fact, if you look at the difference:

pbA from your example:
: ac ed 00 05 73 72 00 0f 71 33 39 31 39 33 34 39   ....sr..q3919349
: 34 2e 53 74 61 74 65 00 00 00 00 00 00 00 01 02   4.State.........
: 00 01 4c 00 04 6d 61 70 73 74 00 10 4c 6a 61 76   ..L..mapst..Ljav
: 61 2f 75 74 69 6c 2f 4c 69 73 74 3b 78 70 73 72   a/util/List;xpsr
: 00 13 6a 61 76 61 2e 75 74 69 6c 2e 41 72 72 61   ..java.util.Arra
: 79 4c 69 73 74 78 81 d2 1d 99 c7 61 9d 03 00 01   yListx.....a....
: 49 00 04 73 69 7a 65 78 70 00 00 00 01 77 04 00   I..sizexp....w..
: 00 00 01 73 72 00 14 71 33 39 31 39 33 34 39 34   ...sr..q39193494
: 2e 4d 61 70 57 72 61 70 70 65 72 00 00 00 00 00   .MapWrapper.....
: 00 00 01 02 00 01 4c 00 03 6d 61 70 74 00 0f 4c   ......L..mapt..L
: 6a 61 76 61 2f 75 74 69 6c 2f 4d 61 70 3b 78 70   java/util/Map;xp
: 73 72 00 11 6a 61 76 61 2e 75 74 69 6c 2e 48 61   sr..java.util.Ha
: 73 68 4d 61 70 05 07 da c1 c3 16 60 d1 03 00 02   shMap......`....
: 46 00 0a 6c 6f 61 64 46 61 63 74 6f 72 49 00 09   F..loadFactorI..
: 74 68 72 65 73 68 6f 6c 64 78 70 3f 40 00 00 00   thresholdxp?@...
: 00 00 02 77 08 00 00 00 02 00 00 00 00 78 78      ...w.........xx 

zero from your example:
: ac ed 00 05 73 72 00 0f 71 33 39 31 39 33 34 39   ....sr..q3919349
: 34 2e 53 74 61 74 65 00 00 00 00 00 00 00 01 02   4.State.........
: 00 01 4c 00 04 6d 61 70 73 74 00 10 4c 6a 61 76   ..L..mapst..Ljav
: 61 2f 75 74 69 6c 2f 4c 69 73 74 3b 78 70 73 72   a/util/List;xpsr
: 00 13 6a 61 76 61 2e 75 74 69 6c 2e 41 72 72 61   ..java.util.Arra
: 79 4c 69 73 74 78 81 d2 1d 99 c7 61 9d 03 00 01   yListx.....a....
: 49 00 04 73 69 7a 65 78 70 00 00 00 01 77 04 00   I..sizexp....w..
: 00 00 01 73 72 00 14 71 33 39 31 39 33 34 39 34   ...sr..q39193494
: 2e 4d 61 70 57 72 61 70 70 65 72 00 00 00 00 00   .MapWrapper.....
: 00 00 01 02 00 01 4c 00 03 6d 61 70 74 00 0f 4c   ......L..mapt..L
: 6a 61 76 61 2f 75 74 69 6c 2f 4d 61 70 3b 78 70   java/util/Map;xp
: 73 72 00 11 6a 61 76 61 2e 75 74 69 6c 2e 48 61   sr..java.util.Ha
: 73 68 4d 61 70 05 07 da c1 c3 16 60 d1 03 00 02   shMap......`....
: 46 00 0a 6c 6f 61 64 46 61 63 74 6f 72 49 00 09   F..loadFactorI..
: 74 68 72 65 73 68 6f 6c 64 78 70 3f 40 00 00 00   thresholdxp?@...
: 00 00 00 77 08 00 00 00 01 00 00 00 00 78 78      ...w.........xx

The only difference is the couple of bytes that specify load factor and such. Obviously, these bytes would be different - of course they would if you specify a different initial capacity that was ignored by the first deserialization. This is a red herring.

You are concerned about a corrupt deep copy, but this concern is misplaced. The only thing that matters, in terms of correctness, is the result of the deserialization. It just needs to be a correct, fully functional deep copy that doesn't violate any of your program's invariants. Focusing on the precise serialized bytes is a distraction: You don't care about them, in fact you only care that the result is correct.

Which brings us to the next point:

The only real issue you face here is a difference in long term performance (both speed and memory) characteristics from the fact that some Java versions ignore the initial map capacity when deserializing. This does not affect your data (that is, it will not break invariants), it only potentially affects performance.

So your very first step is to ensure that this is actually a problem. That is, it boils down to a potential premature optimization issue: Ignore the difference in the deserialized map's initial capacity for now. If your application runs with sufficient performance characteristics then you have nothing else to worry about. If it doesn't, and if you are able to narrow the bottlenecks down to decreased deserialized hash map performance due to a different initial capacity, only then should you approach this problem.

And so, the final part of this answer is, if you determine that the performance characteristics of the deserialized map actually are insufficient, there are a number of things you can do.

The simplest, most obvious one I can think of is to implement readResolve() on your object, and take that opportunity to:

Construct a new map with the appropriate parameters (initial capacity, etc.)
Copy all of the items from the old deserialized map to the new one.
Then discard the old map and replace it with your new one.

Example (from your original code example, choosing the map that yielded the "false" result):

class MapWrapper implements Serializable {

    private static final long serialVersionUID = 1L;

    Map<String, Integer> map = new HashMap<>(2);

    private Object readResolve () throws ObjectStreamException {
        // Replace deserialized 'map' with one that has the desired
        // capacity parameters.
        Map<String, Integer> fixedMap = new HashMap<>(2);
        fixedMap.putAll(map);
        map = fixedMap;
        return this;
    }

}

But first question if this is really causing issues for you. I believe you are overthinking it, and that hyper-focusing on the byte-for-byte serialized data comparison is not productive for you.

It's not only a performance issue. An exception is thrown as my comment reply shows. Should I provide code for that? — user1803551, Aug 28 '16 at 18:30

How to deep copy a map with initial capacity?

1 Answers1