tree / diff-algorithm

Question

I'm currently writing a diff-algorithm to detect inserts, deletes, updates and moves between two revisions of a tree whereas each node has a unique ID which doesn't change through the revisions.

I'm going to traverse each tree in preorder and generate diffs between two nodes on the fly and move the cursors accordingly (for instance after a deleted node is encountered only the cursor on the old revision is moved forward and vice versa for inserted nodes).

Now my problem is that I have to detect the cut and paste points in case of moves (where a moved node is cutted from the old revision and pasted into the new revision) in order to move the right cursor forward and for a subsequent visualization of an agglomerated tree representation.

We have a simple parent/leftsibling/rightsibling/firstchild/currnode encoding whereas each node has a unique ID, a long value. Because this encoding has no knowledge of global ordering I first thought about searching the oldNodeKey in the new revision after the current node in document order and do the vice versa for the cursor on the old revision and save if the node is found and after how many node visits:

/**
 * Search for the supplied node key in following nodes.
 * 
 * @param paramRtx
 *            Treetank {@link IReadTransaction}
 * @param paramNodeKey
 *            node key to search for
 * @return {@code true} if found, {@code false} otherwise
 */
protected Result searchNode(final IReadTransaction paramRtx, final long paramNodeKey) {
    checkNotNull(paramRtx);
    checkArgument(paramNodeKey >= 0);
    final long nodeKey = paramRtx.getNode().getNodeKey();
    boolean found = false;
    int sumNodes = 0;
    for (final AbsAxis axis = new DescendantAxis(paramRtx); !found && axis.hasNext(); axis.next()) {
        sumNodes++;
        if (axis.getTransaction().getNode().getNodeKey() == paramNodeKey) {
            found = true;
        }
    }
    for (final AbsAxis axis = new FollowingAxis(paramRtx); !found && axis.hasNext(); axis.next()) {
        sumNodes++;
        if (axis.getTransaction().getNode().getNodeKey() == paramNodeKey) {
            found = true;
        }
    }
    paramRtx.moveTo(nodeKey);
    return new Result(found, sumNodes);
}

Essentially if newResult.mSum > oldResult.mSum it means the node has been "pasted" and vice versa, and a special case for newResult.mSum == oldResult.mSum but I think it's not correct in case of too many modifications the cut and paste points wouldn't be correctly identified. I've written a lot of code to track different cases, but I think I have to rethink the complete moves-detection stuff :-(

For example I've implemented something like this:

        if (mMovedMap.get(newKey) == null && mMovedMap.get(oldKey) == null) {
            final ExecutorService pool = Executors.newFixedThreadPool(2);
            final Future<Result> foundNew = pool.submit(new Callable<Result>() {
                @Override
                public Result call() throws Exception {
                    return searchNode(paramNewRtx, oldKey);
                }
            });
            final Future<Result> foundOld = pool.submit(new Callable<Result>() {
                @Override
                public Result call() throws Exception {
                    return searchNode(paramOldRtx, newKey);
                }
            });
            pool.shutdown();

            try {
                final Result resultNew = foundNew.get();
                final Result resultOld = foundOld.get();
                paramNewRtx.moveTo(newKey);
                paramOldRtx.moveTo(oldKey);

                if (resultNew.mFound && resultOld.mFound && resultNew.mSumNodes > resultOld.mSumNodes) {
                    moveToNextRightNode(paramOldRtx, null);
                    if (paramOldRtx.getNode().getNodeKey() == newKey) {
                        diff = EDiff.MOVEDCUT;
                        paramOldRtx.moveTo(oldKey);
                        paramNewRtx.moveTo(newKey);
                        fireMovedOldDiffs(paramOldRtx, paramNewRtx, oldKey, diff, paramDepth);
                    } else {
                        diff = EDiff.MOVEDPASTE;
                        paramOldRtx.moveTo(oldKey);
                        paramNewRtx.moveTo(newKey);
                        fireMovedNewDiffs(paramOldRtx, paramNewRtx, newKey, diff, paramDepth);
                    }
                } else if (resultNew.mFound && resultOld.mFound
                    && resultNew.mSumNodes < resultOld.mSumNodes) {
                    moveToNextRightNode(paramNewRtx, null);
                    if (paramNewRtx.getNode().getNodeKey() == oldKey) {
                        diff = EDiff.MOVEDPASTE;
                        paramOldRtx.moveTo(oldKey);
                        paramNewRtx.moveTo(newKey);
                        fireMovedNewDiffs(paramOldRtx, paramNewRtx, newKey, diff, paramDepth);
                    } else {
                        diff = EDiff.MOVEDCUT;
                        paramOldRtx.moveTo(oldKey);
                        paramNewRtx.moveTo(newKey);
                        fireMovedOldDiffs(paramOldRtx, paramNewRtx, oldKey, diff, paramDepth);
                    }
                } else {
                    assert foundOld.get() != null && foundOld.get().mFound;
                    assert foundNew.get() != null && foundNew.get().mFound;
                    assert foundNew.get().mSumNodes == foundOld.get().mSumNodes;
                    ...
                }

whereas mMovedMap is a simple Map to keep track of moved nodes after they have been encountered.

Edit: I try to detect inserts/deletes/updates and moves in a tree, whereas the nodes have unique IDs. The hard part seems to be detecting the moves. I'm doing two preorder traversals (one over the old revision and one over the new revision). It's pretty easy to determine inserts/deletes and updates but I have trouble to detect moves and because I'm always comparing two nodes (one in the old revision with one in the new) I have to know which one of the two has actually moved (if it was the node in the old revision it's the cut point, if the node in the new revision has been moved it's the paste point). I also have to know if it's the node in the old revision or the one in the new revision which has been moved and how because I'm creating an agglomerated tree representation with all edit operations incorporated to visualize the diffs in a specialized Sunburst view.

Edit: I think it's impossible to decide which one is the cutted node (or subtree) and which one the pasted node (or subtree) even if I would have global identifiers. It's not sufficient to know which of the two nodes comes first because of other modifications :(

Edit: Does anyone know if the problem of finding out which node has been moved (comparing two nodes) in a tree is NP-complete? Or more generally detecting if one of both nodes has been moved considering a cursor on a node in the old revision and another cursor is located at a node in the new revision and if the moved node has been cut out of the old tree or if the moved node has been inserted into the new position? The diff algorithm is designed in such a way that I can agglomerate or fuse two trees together such that they share common nodes, which is fine for inserts/deletes/same nodes/updates and most probably also for replaced nodes, but I think it can't be done for moves? I would need a reference if it's NP-complete or unsolvable, because it's a part of my master thesis and at least I want to describe why I haven't implemented the move detection (or reverted a non functional implementation ;-)).

Edit: Maybe a solution is:

// Check if it has been INSERTED, DELETED or MOVED.
// ================================================================
final long nodeKeyOld = paramOldRtx.getNode().getNodeKey();
final long nodeKeyNew = paramNewRtx.getNode().getNodeKey();
final boolean movedOld = paramOldRtx.moveTo(nodeKeyNew);
final boolean movedNew = paramNewRtx.moveTo(nodeKeyOld);
if (!movedNew && mDiff == EDiff.DELETED) {
    paramOldRtx.moveTo(nodeKeyOld);
    if (paramOldRtx.getNode().getNodeKey() == mDeletedKey) {
        movedNew = true;
    }
}

if (movedOld && movedNew) {
    diff = EDiff.MOVED;
} else if (movedOld) {
    paramOldRtx.moveTo(nodeKeyOld);
    mDeletedKey = paramOldRtx.getNode().getNodeKey();
    diff = EDiff.DELETED;
} else {
    diff = EDiff.INSERTED;
}

to detect the MOVE operation itself like I'm doing it now (the special case to check for !movedNew && mDiff == EDiff.DELETED is needed for the end of a tree where only DELETES have been done, but nodes might also be moved). In all other cases it should be sufficient to test if the cursor (transaction) on the new revision can be moved to the node in the old revision and the cursor on the old revision can be moved to the node in the new revision, right?

Then I have to keep track of all upcoming changes (or also nodes which are identical) and if another move is detected I have to check if one of the two node keys (from the node in the old revision and the node in the new revision) has been encountered before. If it's the old node it must have been a cut and the current encountered move is the paste, otherwise vice versa). If it's not one of the keys it must be another move operation.

What do you think? I'm a bit reluctant to implement it if I'm not at least 99% sure if it works. I have spend about 6 days on a solution which didn't work.

Edit: Ok, I think it's a bad idea, because I don't know how to move the cursors forward if I don't know right at the time which one is the node which has been moved.

kind regards,
Johannes

It looks like you are trying to do a greedy evaluation of the differences between two tree structures. That might be very hard to do, as some operations are just misleading. If someone moves a node, and adds a copy of said node, you don't know which one is the original. Sourcecontrol tools cannot do that either, because the problem is either NP-complete or unsolvable. Those tools either ignore moves and just assume it's Delete+Insert, or require the user to specify those moves manually. — Kajetan Abt, Oct 18 '11 at 12:57
Yes, I'm importing XML data into our database system based on an ID-less algorithm, to store the diffs. Since this algorithm uses moves I thought it would be best if the internal diff-algorithm which uses IDs can track these generated moves. Subsequently I'm visualizing the changes in an agglomerated tree representation which itself introduced a lot of complexity compared to side by side views. Even though my visualization should work with moves it seems it's impossible to decide which one is the node which has been moved and if the node has been cut of the tree or has been pasted. — Johannes, Oct 18 '11 at 13:20

tree / diff-algorithm

0 Answers0

Linked