The only PHP library I found for manipulating graphs is the PEAR “Structures_Graph” package ( http://pear.php.net/manual/en/package.structures.structures-graph.php ). It is not currently being maintained, important features (e.g. removing a node) are not implemented, and serious bugs are open (e.g. inability to install under Windows 7). It does not look like that package, in its current form, would be useful.
The basic operations need to manipulate your graph can be split up into those that change the graph (mutating) and those that query it (non-mutating).
Mutating operations
- CreateNode($nodeName), returns a $nodeID – Note that when nodes are created, they have no edges connecting them to other nodes in the graph.
- DeleteNode($nodeID) – To enforce referential integrity, this should only be allowed if all edges connecting to the node have previously been deleted, and an exception thrown otherwise.
- UpdateNodeName($nodeID, $newNodeName) – If it is allowed to change the name of an existing node.
- CreateHorizontalEdge($souceNodeID, $destinationNodeID) – This is a directed edge. Throws exceptions if edge already exists, or if adding edge wound create a cycle.
- DeleteHorizontalEdge($sourceNodeID, $destinationNodeID)
- CreateVerticalEdge($firstNodeID, $secondNodeID) – This is a bidirectional edge, the first and second node IDs can be switched and the effect on the graph will be the same. Throws an exception if edge already exists or if the two nodes do not have the same horizontal parent.
- DeleteVerticalEdge($firstNodeID, $secondNodeID) – As edge is non-directed, it will delete edge even if the arguments were in the opposite order for creation.
Examples:
To build the section of your graph that has node names starting with “B” manually, the code would be:
$nodeID_B = CreateNode(“B”);
$nodeID_B1 = CreateNode(“B1”);
$nodeID_B2 = CreateNode(“B2”);
$nodeID_B3 = CreateNode(“B3”);
CreateHorizontalEdge($nodeID_B, $nodeID_B1);
CreateHorizontalEdge($nodeID_B, $nodeID_B2);
CreateHorizontalEdge($nodeID_B, $nodeID_B3);
CreateVerticalEdge($nodeID_B1, $nodeID_B2);
CreateVerticalEdge($nodeID_B2, $nodeID_B3);
Code to manually remove the node with name “B3”:
// Must remove all edges that connect to node first
DeleteVerticalEdge($nodeID_B2, $nodeID_B3);
DeleteHorizontalEdge($nodeID_B, $nodeID_B3);
// Now no edges connect to the node, so it can be safely deleted
DeleteNode($nodeID_B3);
Non-mutating operations
- NodeExists($nodeID) – Returns true/false
- GetNodeIDByName($nodeName) – Returns $nodeID
- GetNodeName($nodeID)
- HorizontalEdgeExists($sourceNodeID, $destinationNodeID) – Returns true/false
- VerticalEdgeExists($firstNodeID, $secondNodeID) – Returns true/false, same result regardless of argument order.
- HorizontalConnectionExists($startNodeID, $endNodeID) – Returns true/false – Following the horizontal arrows, is there a path from the start node to the end node? To test if creating a new horizontal edge from $nodeID1 to $nodeID2 will create a cycle, call HorizontalConnectionExists($nodeID2, $nodeID1).
- GetHorizontalAncestorNodes($nodeID) – Returns an array of all nodes that have horizontal paths leading from them to the argument node.
- GetHorizontalDescendentNodes($nodeID) – Returns an array of all nodes that have horizontal paths leading from the argument node to the argument node.
- GetVerticalSiblingNodes($nodeID) – Returns an array of all nodes that have vertical connections to the argument node. Since (as per your answer to my clarifying question), vertical relationships are not transitive, the function VerticalEdgeExists (above) is the only one needed to query for vertical relationships.
Example:
To get a list of nodes to be included in the subtree you describe in your question, combine the results of GetHorizontalAncestorNodes($nodeID) and GetVerticalSiblingNodes($nodeID).
Data Structures
You will always need a “Nodes” table to hold nodeID and nodeName. This table can be extended to hold other information associated with the nodes.
Since vertical edges are not transitive, information about them can simply be put in a “VerticalEdges” table with columns vEdgeID, firstNodeID, secondNodeID.
There are several options for how to store the horizontal edge information. On the one hand, the data structures and mutating operations can be simple, but at the cost of making some of query operations slower and more complex. On the other hand, the data structures can be slightly more complex but possibly much larger (growing exponentially with number of nodes in the worse case), with more complex mutating operations, but simpler and faster queries. Deciding which implementation is best for you will depend on the size of your graphs and how often they change compared to the number of times they are queried.
I will describe three possible data structures to represent the graph, moving from simple to complex. I will go into detail on the algorithms for the operations listed above only for the last set of data structures. I think that set of structures is best for smaller graphs that have a high ratio of queries to changes.
Note that all data structures have the “Nodes” and “VerticalEdges” tables I discussed above.
Minimal Data Structure
The first data structure has a “HorizontalEdges” table with columns hEdgeID, sourceNodeID and destinationNodeID. The mutating functions are simple, and most of the code will be error checking code that throws exceptions. The non-mutating functions HorizontalConnectionExists,
GetHorizontalAncestorNodes and GetHorizontalDescendentNodes will be complex and potentially slow. Each time they are called, they will recursively traverse the HorizontalEdges table and collect nodeIDs. Those collections are returned directly for the later two functions, while HorizontalConnectionExists can terminate and return true if it finds the end node while searching descendants of the start node. It will return false if the search finishes without finding the end node.
Transitive Closure table
The second data structure also has a HorizontalEdges table identical to the one described above, but also has a second table “HorizontalTransitiveClosures” with columns hTCID, startNodeID and endNodeID. There is a row in this table for each pair of start node and end node such that a path using horizontal edges can be traced from the start node to the end node.
Example:
For the graph in the question, the rows in this table that include node A (to simplify notation, I will use the names, rather than integer node IDs to identify nodes, and omit the hTCID column) are:
A, A2
A, A2B1
A, A2B1B2
A, X
A, Y
A, Z
Rows that include node A2B1 (the first one is also in the set above) are:
A, A2B1
A2, A2B1
B, A2B1
B1, A2B1
A2B1, A2B1B2
A2B1, X
A2B1, Y
A2B1, Z
In a worse case scenario, this table scales as the square of the number of nodes.
With this data structure, HorizontalConnectionExists,
GetHorizontalAncestorNodes and GetHorizontalDescendentNodes can be implemented as simple searches of the HorizontalTransitiveClosures table. The complexity is moved to the CreateHorizontalEdge and DeleteHorizontalEdge functions. DeleteHorizontalEdge is particularly complex and requires a fair bit of explanation as to why the algorithm works.
Transitive Closure with Paths
The final data structure I will discuss stores the horizontal edge information in two tables. The first, “HorizontalTransitiveClosurePaths” has columns hTCPathID, startNodeID, endNodeID, pathLength. The second table “PathLinks” has columns PathLinkID, hTCPathID, sourceNodeID, destinationNodeID.
The HorizontalTransitiveClosurePaths table is similar to the HorizontalTransitiveClosures table in the data structure described above, but it has one row for each possible path that can accomplish a transitive closure, rather than one row per transitive closure. For example, in the graph shown in the question, the HorizontalTransitiveClosures table would have one row (B, A2B1B2) (shorthand notation as above) for the closure that started at B and ended A2B1B2. The HorizontalTransitiveClosurePaths table would have two rows: (10, B, A2B1B2, 3) and (11, B, A2B1B2, 2), since there are two ways to get from B to A2B1B2. The PathLinks table describes each of those paths with one row per edge making up the path. For the two paths from B to A2B1B2, the rows are:
101, 10, B, B1
102, 10, B1, A2B1
103, 10, A2B1, A2B1B2
104, 11, B, B2
105, 11, B2, A2B1B2
There is no need for a HorizonalEdges table, since the edges can be found by selecting the rows in the HorizontalTransitiveClosurePaths table with a length of 1.
The query functions are implemented the same way as in the Transitive Closure data structure described above. Since multiple paths may exist for a closure, the GROUP BY operator will need to be used. For example, the SQL query that returns all nodes that are ancestors of the node with ID :nodeid is:
SELECT startNodeID from HorizontalTransitiveClosurePaths WHERE endNodeID = :nodeid GROUP BY startNodeID
To implement DeleteHorizontalEdge, search the PathLinks for the hTCPathID of all paths that contain the edge. Then delete those paths from the HorizontalTransitiveClosurePaths table and all edges associated with the paths from the PathLinks table.
To implement CreateHorizontalEdge($souceNodeID, $destinationNodeID), first search HorizontalTransitiveClosurePaths table for paths that end at $souceNodeID – this is the “ancestor path set”. Search the HorizontalTransitiveClosurePaths for paths that start at destinationNodeID – this is the “descendent path set”. Now insert new paths from the following 4 groups (some of which may be empty) into the HorizontalTransitiveClosurePaths table and insert the links for those paths in the PathLinks table.
- The length 1 path from $souceNodeID to $destinationNodeID
- For each path in the ancestor path set, a new path that extends the end of the path by one additional link from $souceNodeID to $destinationNodeID
- For each path in the descendant path set, a new path that extends the start of the path by one additional link from $souceNodeID to $destinationNodeID
- For each combination of one path from the ancestor path set and one path from the descendant path set, a path created by slicing the ancestor path, to the link from $souceNodeID to $destinationNodeID, to the descendant path.
Example:
A graph consists of 6 nodes: A1, A2, B, C, D1 and D2. It has 4 edges, (A1, B), (A2, B), (C, D1), (C, D2). The HorizontalTransitiveClosurePaths table (using node name rather than a number) is:
1, A1, B, 1
2, A2, B, 1
3, C, D1, 1
4, C, D2, 1
The PathLinks table is:
1, 1, A1, B
2, 2, A2, B
3, 3, C, D1
4, 4, C, D2
Now we add the edge from B to C. The ancestor path set is (1, 2) and the descendant path set is (3, 4)
The paths added in each of the 4 groups are:
- The new edge itself: HorizontalTransitiveClosurePaths: (5, B, C, 1); PathLinks (5, 5, B, C)
- Extend each ancestor path by one link at the end:
HorizontalTransitiveClosurePaths:
6, A1, C, 2
7, A2, C, 2
PathLinks:
6, 6, A1, B
7, 6, B, C
8, 7, A2, B
9, 7, B, C
- Extend each descendent path by one link at the start:
HorizontalTransitiveClosurePaths:
8, B, D1, 2
9, B, D2, 2
PathLinks:
10, 8, B, C
11, 8, C, D1
12, 9, B, C
13, 9, C, D2
- Splice together all combinations containing one ancestor path and one descendant path:
HorizontalTransitiveClosurePaths:
10, A1, D1, 3
11, A1, D2, 3
12, A2, D1, 3
13, A2, D2, 3
PathLinks:
14, 10, A1, B
15, 10, B, C
16, 10, C, D1
17, 11, A1, B
18, 11, B, C
19, 11, C, D2
20, 12, A2, B
21, 12, B, C
22, 12, C, D1
23, 13, A2, B
24, 13, B, C
25, 13, C, D2
Let me know if any part of the answer needs additional clarification.