22

I know of at least two byte-code enhancer that modify the "object model" at runtime to allow transaction to be performed transparently. One of them is part of Versant VOD, which I use at work every day, and the other is part of Terracotta. There are probably quite a few others, for example in ORM, but Versant takes care of that at my company.

My question is, is there such an open-source API that can be used on it's own, independent of the product that it was designed for? You could say an "hackable" API. It should only track changes, not read access, which would slow down the code significantly. In other words, it should not require explicit read/write locking. This requires either access to all classes that perform changes, not just to the data model, or it requires to keep some form of "previous version" in memory to do a comparison.

The problem that I'm trying to solve is that I have "large" (32K to 256K) object graphs that are "serialized" in a (NoSQL) DB. They are long-lived and must be re-serialized regularly to have an "history" of the changes. But they are rather expensive to serialize, and most changes are minor.

I could serialize them fully each time and run a binary diff on the stream, but that sounds very CPU intensive. A better solution would be an API that modify write operations on the model to protocol the changes, so that after the initial "image" is stored, only the protocol need to be stored.

I've found some questions talking about Apache Commons Beanutils to compare objects, but that is not useful for in-place changes; I would need to make a complete clone of the model between every "business transaction".

To reiterate, I'm looking for an "in-memory" API, within the same JVM, which does not involve any external server application. APIs involving native code are OK if they are available on Win, Mac & Linux. The API does not have to be currently packaged independently; it just has to be possible to extract it from the "parent project" to form an independent API (the parent project license must allow this).

My object graphs will involve many large arrays, and so that needs to be supported efficiently.

The changes are not desired only for auditing, but so that they can be replayed, or undone. More precisely, with the deserialized initial graph, and a list of changes, I should arrive at an identical end graph. Also, starting with the end graph, it should be possible to go back to the initial graph by applying the changes in reverse. This uses exactly the same functionality, but requires the change protocol to keep the old value in addition to the new value.

The API license should be compatible with commercial use.

[EDIT] So far I did not get a useful answer, and it does not seem like what I want exists. That leaves me with only one option: make it happen. I'll post a link here as answer when I have a working implementation, as this is the next step in my project and I cannot go forward without it.

[EDIT] I found by accident this somewhat related question: Is there a Java library that can "diff" two Objects?

Community
  • 1
  • 1
Sebastien Diot
  • 7,183
  • 6
  • 43
  • 85
  • Not totally sure what you're after. But its sounds analogous to subversion. If the objects were serialized to a file, that was checked into SVN, then the change history would be there. Not very helpful if you're after programmatic access to the changes. What do you need? – scorpdaddy May 03 '12 at 20:49
  • LOL! I see what you mean, but that doesn't fit at all to my use case. Firstly, I should all be "in-memory", not client-server based, and secondly, all RCS are inherently bad at binary data, which is what I am trying to version. No XML or JSON involved here. I'll update the question. – Sebastien Diot May 04 '12 at 10:37
  • I don't know of any such API, but maybe you could think in order approaches. Instead of trying the transparent-API approach, it sounds that it is actually a requirement for your application. You may store the initial graph once, and then build a tree with the versions and the new data. – Daniel H. May 04 '12 at 11:28
  • It does not have to be transparent. My problem is that I want to also use "third-party" beans, and I could not rely on them to write the "change tracking" code manually without errors. A code-generator (bean-interface to change-tracking-bean) could also be acceptable, depending on it's limitations. – Sebastien Diot May 04 '12 at 12:44
  • 2
    if you switch your serialization over to something like BSON (http://bsonspec.org/) it would be easier to isolate the "diff" to specific attributes (since the binary for the rest of the attributes would be the same) – radai May 06 '12 at 11:06
  • @radai BSON looks alright, although it's missing int16, of which I will probably have large arrays. But how would I create a diff? Using a binary diff impl on the complete serialized graph? That sounds expensive. – Sebastien Diot May 06 '12 at 13:00
  • 1
    probably would be. you might be able to "diff" the object with a prev ious copy of it while serializing but its "O(n)" either way unless you use something like ASM/javassist to dynamically subclass all of your domain classes and overwrite all the setters to mark "dirty" fields. but even if you mark dirty fields you might miss out on things like: someDomainObject.getSomeInternalClass().nonSTandardMethodThatCHangesState() comparing the serialized form is the only "bulletproof" way (especially if you have no control over 3rd parties). using BSON just allows you to narrow down diffs to fields – radai May 06 '12 at 13:28
  • What's wrong with Terracotta? Or for that matter with VmWare Gemfire (except for being enterprise) – Stephan Eggermont May 10 '12 at 16:46
  • @radai I also think that serializing the whole graph is safer that modifying the bytecode to keep track of "dirty fields". I think I got you wrong before; you meant creating a diff at the BSON API level, not using a generic binary diff, right? That sounds reasonable. – Sebastien Diot May 10 '12 at 17:35
  • @StephanEggermont I never heard of Gemfire; going to have a look at it. Unfortunately, I liked Terracotta but it hates me. *Twice* I tried to replace an home-grown comm API with it: it works on the test servers perfectly, but fails as soon as you put it on the productive server. That meant cancelling two maintenance releases. There is no way I'm going to let it make a foul of me a third time. Also, it's no good with untrusted clients across the Internet; you have to thrust your clients, and they have to work synchronously over a low latency network due to the global locking. – Sebastien Diot May 10 '12 at 17:39
  • Fail as in hardware differences or as in performance under load? – Stephan Eggermont May 10 '12 at 19:11
  • @StephanEggermont Fails as in the client JVM get stuck at start because the TC agent cannot connect to the TC server. Test servers and productive server where configured the same way, without a firewall in between. We never found out why. But this is off-topic and should not be discussed here. – Sebastien Diot May 11 '12 at 08:08

4 Answers4

8

Kryo v1 had a serializer that knows about the last data that was serialized and only emits a delta. When reading, it knows about the last data received and applies the delta. The delta is done on at the byte level. Here is the serializer. Most of the work is done by this class. This could be used in a few useful ways, eg networking similar to Quake 3.

This was omitted in Kryo v2 because AFAIK it had never been put to use. Also, it did not have an extensive set of tests. It could be ported though and may do what you need, or serve as the basis for what you need.

Above also posted on JVM serializers mailing list.

Doing it at the object level would be a bit tricky. You could write something similar to FieldSerializer that walks two object graphs simultaneously This would be standalone code though, not a Kryo serializer. At each level you could call equals. Write a byte so that when you read you know if it was equals. If not equals, use Kryo to write the object. Equals would be called many times for the same object, especially for deeply nested objects.

Another way you might do it is to only do the above for scalars and strings, ie only values written by the Output class. The problem is walking two object graphs. To use Kryo I think you'd have to duplicate all the serializers to know about the other object graph.

Possibly you could use Kryo with your own Output that collects values in a list instead of writing them. Use this to "serialize" your old object graph. Now write another version of your own Output that takes this list and use it to serialize your new object graph. Each time a value is written, first check it with the next object in your list. If equals, write a 1. If not equals, write a 0 and then the value.

This could be made more space efficient by using the first Output twice, once on the old and once on the new graph. Now you have two lists of values. Use these to write a bitstring denoting which are equal. This saves space over writing a whole byte for each value, but has the overhead of an extra list. Finally, write all the values that are not equal.

To finish this idea, you need to be able to deserialize the data. You'll need an your own version of the Input class that takes a list of values from the old object graph. Your Input first reads the bitstring (or a byte per value). For a value that was equal, it returns the value from the list instead of reading from the data. If a value was not equal, it calls the super method to read from the data.

I'm not sure if this would be faster than doing it at the byte level. If I had to guess I'd say it probably would be faster. Storing all values in a list will be lots of boxing/unboxing, and this approach still assigns all fields even if they haven't changed. I doubt performance will be a problem either way, so I'd probably just choose the easier approach. Hard to say which that is tho... resurrect the delta stuff or write your own Output/Input classes.

If you feel like contributing back to Kryo, that would of course be great. :)

NateS
  • 5,751
  • 4
  • 49
  • 59
  • This might well be my best bet. :) But please write something about it here too, so SO readers don't have to follow the link. – Sebastien Diot May 11 '12 at 08:12
  • Just looked at Kryo again. I had dismissed it before due to lack of facility for long-term storage (schema evolution), but V2 seems to solve that. :) – Sebastien Diot May 11 '12 at 10:59
  • Updated with some more info. TaggedFieldSerializer is the next step up from FieldSerializer schema evolution-wise. After that there is CompatibleFieldSerializer. – NateS May 11 '12 at 20:27
2

Take a look at Content repository API for Java, it is used by Artifactory to control maven dependencies. The Apache Jackrabbit is the reference implementation of this JSR (JSR-283 version 2)

  • This is very interesting. And it says this: "Another popular type is the versionable type. This makes the repository track a document's history and store copies of each version of the document." :) – Sebastien Diot May 11 '12 at 10:42
1

I do not know such API, but it cant be that complicated:

A better solution would be an API that modify write operations on the model to protocol the changes, so that after the initial "image" is stored, only the protocol need to be stored.

I would say you need only 2 components: Action and ActionProcessor

You only need to persist a list (protocol) of performed actions.

interface ActionProcessor{
    void perform(Action action);
    void undoToDate(Date date);
} 

iterface Action{
    Date getDate();
    void perform();
    void undo();
}      
Roman K
  • 3,309
  • 1
  • 33
  • 49
  • 2
    This would be the http://prevayler.org/ approach. While it would work, the amount of code required to *manually* implement the undo functions for every action would be tremendous, which would result in many bugs, which would result in failure to undo correctly. That would be the last solution, if all else fails. Secondly, once you bring out a new version of your code, your Action impl would change, causing you to loose reproducibility, and "undoability", because you would have recorded the Action ID and it's parameters, instead of the real change. – Sebastien Diot May 10 '12 at 11:36
1

As far as I know, GemFire is a Gemstone (now VmWare) enterprise product doing something similar to the Gemstone smalltalk OODB, but then for java. James Foster has created a series of videos on how Gemstone works. I found them very interesting. Gemstone has a free version to build small (Seaside web) systems with.

Stephan Eggermont
  • 15,847
  • 1
  • 38
  • 65