Algorithm to detect when graph re-converges (similar to common subtree?)

Question

I've been working on this problem all day, I'm re-writing one of our legacy products and I'm having a hard time determining how to find a specific node in my flow chart. The problem reminds me of University, but for the life of me I can't come up with an algorithm to solve this.

I've attached 3 screen shots to help explain this, but the basic problem is, given a YES/NO? decision node, locate the closest child node that terminates the branch.

I'm working in C# .NET and JSON. In JSON I've got an object that gives each node a unique identifier, and also identifies each "link" from one node to the next. I would hope to write a function (or several) to determine the first "end node" given a branched node in C#. Currently I've built out the jSON into XML in C#.

Any and all ideas encouraged, not really looking for code but an approach/algorithm.

enter image description here

given the yes/no find the delay block.. first node that all child nodes traverse to

Attached is the output in jSON from the diagram:

{ "class": "go.GraphLinksModel",
  "linkFromPortIdProperty": "fromPort",
  "linkToPortIdProperty": "toPort",
  "nodeDataArray": [ 
{"key":-1, "category":"Start", "loc":"169 288", "text":"Start"},
{"key":-2, "category":"End", "loc":"855 394", "text":"End"},
{"category":"Branch", "text":"Yes or No", "key":-4, "loc":"284.8837209302326 285.7848837209302"},
{"category":"DelayNode", "text":"Delay", "key":-3, "loc":"365.8837209302326 215.52345997177622"},
{"category":"Branch", "text":"Yes or No", "key":-5, "loc":"478.8837209302326 214.52345997177622"},
{"category":"DelayNode", "text":"Delay", "key":-6, "loc":"568.8837209302326 151.52345997177622"},
{"category":"DelayNode", "text":"Delay", "key":-7, "loc":"573.8837209302326 268.5234599717762"},
{"category":"DelayNode", "text":"Delay", "key":-8, "loc":"653.8837209302326 215.52345997177622"},
{"category":"Branch", "text":"Yes or No", "key":-9, "loc":"392.8837209302326 392.5234599717762"},
{"category":"DelayNode", "text":"Delay", "key":-10, "loc":"454.8837209302326 317.5234599717762"},
{"category":"DelayNode", "text":"Delay", "key":-11, "loc":"550.8837209302326 473.5234599717762"},
{"category":"DelayNode", "text":"Delay", "key":-12, "loc":"549.8837209302326 317.5234599717762"},
{"category":"DelayNode", "text":"Delay", "key":-13, "loc":"711.8837209302326 343.5234599717762"},
{"category":"Branch", "text":"Yes or No", "key":-14, "loc":"434.8837209302326 487.5234599717762"}
 ],
  "linkDataArray": [ 
{"from":-4, "to":-3, "fromPort":"T", "toPort":"L", "visible":true},
{"from":-1, "to":-4, "fromPort":"R", "toPort":"L"},
{"from":-3, "to":-5, "fromPort":"R", "toPort":"L"},
{"from":-5, "to":-6, "fromPort":"T", "toPort":"L", "visible":true},
{"from":-5, "to":-7, "fromPort":"B", "toPort":"L", "visible":true, "text":"NO"},
{"from":-6, "to":-8, "fromPort":"R", "toPort":"L"},
{"from":-7, "to":-8, "fromPort":"R", "toPort":"L"},
{"from":-4, "to":-9, "fromPort":"B", "toPort":"L", "visible":true, "text":"NO"},
{"from":-9, "to":-10, "fromPort":"T", "toPort":"L", "visible":true},
{"from":-10, "to":-12, "fromPort":"R", "toPort":"L"},
{"from":-11, "to":-13, "fromPort":"R", "toPort":"L"},
{"from":-12, "to":-13, "fromPort":"R", "toPort":"L"},
{"from":-8, "to":-13, "fromPort":"R", "toPort":"L"},
{"from":-13, "to":-2, "fromPort":"R", "toPort":"L"},
{"from":-9, "to":-14, "fromPort":"B", "toPort":"L", "visible":true, "text":"NO"},
{"from":-14, "to":-11, "fromPort":"T", "toPort":"L", "visible":true},
{"from":-14, "to":-11, "fromPort":"B", "toPort":"L", "visible":true, "text":"NO"}
 ]}

In all of your examples, once you take one branch or the other, you always eventually end up at "end". You never loop back around and possibly enter the decision node again. Does that property hold in general, or do you need to consider situations where there are loops? The solution is straightforward if there are no such loops; if loops can exist then it's slightly trickier. — Eric Lippert, Mar 08 '13 at 21:22
@EricLippert you are correct in assuming that there are no loops back, everything moves forward. I think my comment on the answer below is actually the solution I'm looking for. — Brian Garson, Mar 08 '13 at 21:27
Excellent. Note also that you need to have the property that there is exactly one "end", or at least that if there is more than one "end" node, they compare as equal. — Eric Lippert, Mar 08 '13 at 21:30
I've given it some more thought and I'm not sure the given answer is correct. See my comment. — Eric Lippert, Mar 08 '13 at 22:31

Eric Lippert · Answer 1 · 2013-03-09T15:52:24.197

UPDATE: I've come up with another solution, this time using a standard solution to the Lowest Common Ancestor problem. See my other answer.

As noted in the comments to Oren's answer, Oren has actually answered the question "what is the closest node that can be reached from both branches?" But the question that is actually to be answered is "what is the closest node that must be reached by both branches?"

That's a considerably harder problem to solve and I do not know an efficient solution off the top of my head. However, here's a sketch of an algorithm that would work.

Suppose the given decision node is called A, and WOLOG it has two children B and C. The question then is what is the node, call it G, that has these two properties:

Every possible path from B to END, and every possible path from C to END, passes through G.
G is the closest node to B and C that has the first property.

(Note that G might be END. We might have A-->B, A-->C, B-->END, C-->END.)

We can start by making a set of candidates for possible G's easily enough. Pick any path from B to END -- choose it at random if you like -- and put its nodes in a hash set. Then pick any path from C to END, and put its nodes in a hash set. The intersection of those two sets contains G. Call the intersection set Alpha.

So now let's remove from Alpha all the nodes that are definitely not G. For each node N in the set Alpha:

Construct a new graph identical to the original graph, but missing N.
Is END still reachable from either B or C in the new graph? If yes, then N is definitely not G. If no, add N to set Beta.

If when we're done that Beta is empty then G is END.

Otherwise, there are nodes in Beta. Each one of those nodes has the property that if it is removed, there is no other way to get from B or C to END. Exactly one of those nodes has to be closest to B -- if two were equally close then one of them would not be necessary! -- so do a breadth-first traversal from B, and the first time you encounter a node in Beta, that's your G.

This sketch does not seem like it would be fast if the graph is large, but I'm pretty sure it will work. I would be interested to know if there is a better solution.

I think the idea is to start searching the graph, starting at a decision node. If you reach another decision node, you record your current path. And start searching again (for closure). If a dead end or max-depth is reached, return no solution. Else, if you find closure, you coalesce that subgraph into a "delay" node, and use your recorded path to back up to the previous decision node and repeat. — Alan, Mar 09 '13 at 11:28

Eric Lippert · Answer 2 · 2013-03-09T15:55:46.577

I like Alex's answer and it seems highly efficient. Here's another way to solve your problem, which as it turns out is actually the Lowest Common Ancestor problem. This is an extremely well studied problem in graph theory; I just didn't see it at first because you have to reverse all the arrows.

With this solution you do an expensive preprocessing once, and then you have a data structure that you can just read the answer off of.

enter image description here

I've labelled the nodes in your graph. What you do is first you reverse all the arrows and then you construct the Eulerian traversal of the resulting DAG, at each step remembering how far from the "root" (end) you are.

The Eulerian traversal goes "visit yourself, recurse on a neighbour, visit yourself, recurse on a neighbour, ... visit yourself". That is, if we were writing it in C# we'd say something like:

void Eulerian(Node n, int i, List<Tuple<Node, int>> traversal)
{
    traversal.Add(Tuple.Create(node, i));
    foreach(Node neighbour in node.Neighbours)
    {
        Eulerian(neighbour, i + 1, traversal);
        traversal.Add(Tuple.Create(node, i));
    }
}

The Eulerian traversal is the traversal you would actually take if you were walking the graph.

The graph you've given as an example has the following Eulerian traversal when you reverse all the arrows and start from END.

A0 B1 C2 D3 E4 F5 G6 H7 G6 F5 E4 D3 C2 I3 E4 F5 G6 H7 G6 F5 E4 I3 C2 
B1 J2 K3 L4 G5 H6 G5 L4 K3 J2 B1 M2 N3 L4 G5 H6 G5 L4 N3 M2 N3 L4 G5 
H6 G5 L4 N3 M2 B1 A0

The answer to your question can now be read off the traversal. In your examples, you have decision nodes E, L and G.

For E: find the first and last occurrences of E in the traversal. Now search the nodes between those two in the traversal for the one with the smallest number. It's C, with a score of 2.

For L: find the first and last occurrences of L in the traversal. The node with the smallest number between them is B, with a score of 1.

For G: again the node with the lowest score between the first and last occurrences of G is B.

Computing the traversal might be expensive if the graph is large, but the nice thing is that you only have to do it once. After that, it's just a linear search problem. (And you could make it a sublinear search problem if you really wanted to, but that seems like a lot of extra work.)

I really like this answer, I've gotten my own variation of a different algorithm worked out in my head, but I think I may implement this one instead. — Brian Garson, Mar 11 '13 at 14:35

score 2 · Answer 3 · answered Mar 08 '13 at 21:20

2

If I understand the problem, you're trying to find the first common node between the "yes" and "no" subtrees. In that case, a simple way to do that would be an breadth-first traversal of the yes subtree, adding each element to a list. Then do a breadth-first traversal or the no subtree, stopping when an element in that subtree appears in the "yes" list.

answered Mar 08 '13 at 21:20

Oren Melzer

749
4
7

I think this makes sense. I suppose (the graph here doesn't show an example) but when I encounter a nested branch (yes/no) node I should recurse back into the same algorithm to solve this branches common node, before continuing on in my top most branch. – Brian Garson Mar 08 '13 at 21:25
1

This is the solution I was going to recommend. All I would add is that you should use a hash set rather than a list if the number of nodes is going to be large. – Eric Lippert Mar 08 '13 at 21:28
By nested, do you mean like the bottom picture? You shouldn't need to recurse this algorithm for that. When you reach a yes/no, this is just like a two-child node in a regular binary tree. You traverse "this", then the yes child, then the no child. – Oren Melzer Mar 08 '13 at 21:29
1

@EricLippert: good point. Using a hashset improves the efficiency of this algorithm. It also means that your first traversal can be in any order you want, unlike the second one which must still be breadth-first. – Oren Melzer Mar 08 '13 at 21:31
thank you guys.. I can't believe I spent all day and overlooked this. Cheers – Brian Garson Mar 08 '13 at 21:33
Do you think a depth first search of the no subtree would get a result faster? I believe that any path that gets chosen as the "first" path in depth first order would have to include the node you are looking for. – mbeckish Mar 08 '13 at 21:39
2

Wait, I'm not so sure this answer is right; I think the problem is underspecified. Suppose we have the graph START->A, A->B, A->C, B->E, B->G, C->D, C->E, D->F, E->G, F->G, G->END. We wish to find the nearest common descendent of B and C. Is the correct answer E or G? If it is E then this algorithm is right. But it is plausible that the OP wants G. – Eric Lippert Mar 08 '13 at 22:27
"E" in my example is the *earliest node you can get to via either B or C*. "G" is the first node that you *must* get to no matter whether you go through B or C. Which one do you want? Or are neither of these a correct characterization of the problem? – Eric Lippert Mar 08 '13 at 22:34
1

Another way of looking at it is: Suppose you want to draw a box around "A" and a bunch of other nodes, such that there is only one line coming into the box and only one line coming out of the box. If the box contains as few nodes as possible, what node does the line leaving the box connect to? The answer to that question is "G", and I think that's what the OP means by "terminates the branch". Yes? – Eric Lippert Mar 08 '13 at 22:54
My interpretation of the problem was the first node which can be reached by either Yes or No, which is E in your example. G is the answer to a different question: what is the first node that is reachable by either Yes or No and by all Yesses and Nos in all child branches? To me the first question makes more sense, but I don't know the OP's intent for sure. – Oren Melzer Mar 08 '13 at 23:44
@EricLippert it would be G in the one you specified. The answer provided still makes sense, but recursion is necessary: travel down the yes nodes, whenever you reach a decision branch, enter the algorithm again for that decision branch subset, and use the answer from entering that as your next node, and continue. Keep calling the algorithm over as you reach more decision nodes until you hit the last node. Then start on the no branch, and apply the same approach while checking your list of possible end nodes from your yes branch. – Brian Garson Mar 09 '13 at 00:53
@BrianGarson: I haven't worked it out but I am not convinced that a recursion on the given algorithm is (1) well-founded -- does it actually solve the problem without unbounded recursion, etc, and (2) efficient if the graph is complex. – Eric Lippert Mar 09 '13 at 01:03

score 1 · Answer 4 · answered Mar 09 '13 at 11:48

This can be done in O(V + E) time using two breadth/depth-first searches.

First, let's characterize the problem. We are given a graph, a starting node A and an end node END. As defined by @EricLippert (and slightly adapted by me, as we can equivalently use A instead of B and C), we are looking for the node G with these two properties:

Every possible path from A to END passes through G.
G is the closest node to A that has the first property.

First, we find a path from A to END using an X-first search. Let P = (p1, p2, p3, ..., pk) with p1 = A and pk = END be this path, and number them 1, 2, ..., k (so pi has number i). Store these numbers with the nodes in the graph. Remove the edges used by P from the graph. Do another X-first search from A and find the highest numbered node pi that is reachable from A. We slightly adapt this search: every time we find a reachable node pj that is higher than the previous highest reachable node pk, we add all the edges in the sub-path (pk, pk+1, ..., pj) back in the graph, and mark all these nodes in between as reachable and also add all the nodes that were not reachable before to the queue/stack used in our X-first search for processing. Return the pi found in this manner.

We will now show that pi satisfies the conditions of G. Suppose (for the purpose of reaching a contradiction) we have a path S=(s1, s2, s3, ..., sj) with s1 = A and sj = END that does not go through pi. Some prefix (s1, s2, s3, ... sm) of this path agrees with our path P, so s1=p1, s2=p2, ..., sm=pm, but sm+1 != pm+1. As S did not go through pi and pi was reached by our second X-first search, sm is made reachable in our second X-first search when we add the edges used by (s1, s2, s3, ..., sm). Therefore, our second X-first search will also traverse the rest of the path S, until it either reaches END or some edge used in S (pr, pr+1) lies on the path P but r > i and hence this edge is removed. However, then we would have found that pr (or END) is reachable and r > i so we would not have returned pi, which is a contradiction. Hence, all paths go through pi.

Now suppose there is some node G' that has the properties as above, but is 'closer' to A than pi. As all paths go through G', P also goes through G', so there is some v with pv=G' and v < i. As all paths go through G' and (pv, pv+1) was removed, no nodes beyond pv are reachable initially, and hence (pv, pv+1) is never added, so pi never becomes reachable either, so it cannot have been returned by our algorithm, which is a contradiction. Thus pi is the closest such node, satisfying the conditions of G and the algorithm is correct.

The first X-first search takes O(V+E) time. The second also takes O(V+E) time: all nodes are added to the queue/stack at most once, so the running time is preserved even though we add edges back to the graph; the final edge count does not exceed the edge count of the original graph either. Maintaining the indices of the path and the highest index found so far takes O(V+E) time as well, so we conclude our algorithm works in O(V+E) time.

score 0 · Answer 5 · answered Mar 11 '13 at 14:30

So I've put more thought into this taking into account some of the answers provided here, I haven't quite started writing the code for it but given one of the answers I've modified it slightly into pseudo code.

I think a more efficient approach would be to keep track of all the results for any branch node in a common dictionary variable (key,value) pair. Any thoughts to this approach? One thing I should mention is that the graph should never grow to more than 25 nodes.

//call this function for all nodes with 2 children
int getFirstCommonAllpathEndNode(currentNodeId, xml, endNodeId){

    Array[] possibleNodeIds;
    //traverse child nodes
    foreach(childnode of CurrentNode.TopPortChildNode){
        if(childNode has 2 ports)
        {
            possibleNodeIds.add( getFirstCommonAllPathEndNode(childNode.id, xml, endnodeId));
        }
        else{
            possibleNodeIds.add(childNode.Id);
        }
    }

    foreach(childnode of CurrentNode.BottomPortChildNode){
        if(childNode has 2 ports)
        {
            //recursive call for this branch to get it's first common all path end node
            var someNodeId = getFirstCommonAllPathEndNode(childNode.id, xml, endnodeId) 
            if(possibleNodeIds contains someNodeId) 
                return someNodeId;
            //otherwise skip forward to someNodeId as next node to process in for loop
        }
        else{
            if (possibleNodeIds contains childNode.Id)
                return childNode.id
        }
    }
    //return the endNode if nothing else is satisfied. although this code should never be hit
    return endNodeId;
}

If the graph never has more than 25 nodes then pretty much any solution you choose is going to perform reasonably well. That said, if a graph is immutable then memoization is certainly a reasonable strategy. You might consider writing a general-purpose function memoizer rather than adding memoization logic to one of your functions. I'll give you one in the next comment. — Eric Lippert, Mar 11 '13 at 16:13
`static Func Memoize (this Func func) { var dictionary = new Dictionary (); return (A a)=>{ R r; if (!dictionary.TryGetValue(a, out r)) { r = func(a); dictionary.Add(a, r); } return r; }; }` And now you can take any function and pass it to `Memoize` and hey presto, you have a memoized version of that function. Very handy, and you don't have to mix your memoization logic in with your business logic. — Eric Lippert, Mar 11 '13 at 16:16

Algorithm to detect when graph re-converges (similar to common subtree?)

5 Answers5