How do I enumerate all paths in a Directed Acyclic Graph structure using Python, when the data structure can only be discovered via traversal?

Question

Please allow me to clarify a few details before beginning:

I've really worked hard to research this problem on my own. I have looked at solutions like this or this but all those solutions assume you have the entire data structure in memory.
This is for something practical, and is not a homework assignment or job interview question. I've attempted to come up with a solution and have been unable, even though I know it's theoretically possible.
The specifics: I am using python/Selenium to traverse a Twine file, but I'm showing a generalized pseudocode interface because those details aren't important to solving the problem, only to demonstrate that I need to traverse the path to discover it.

So, I have a Directed Acyclic Graph with about 150 nodes. There is a single start node, and many end nodes.

Imagine I have an interface similar to this to traverse the graph:

class TheGraphToTraverse:

    def get_the_first_node(self):
        pass

    def get_the_list_of_edges_from_the_current_node(self):
        """
        Returns list of links/edges that can be followed
        """

    def follow_this_edge(self, edge):
        """
        Follow an edge link to its destination node
        """

    def go_back_one_step(self):
        """
        Navigate back one step
        """

    def go_back_to_the_start_node(self):
        """
        Start over from the beginning node
        """

    def is_this_node_an_end_node(self):
        """
        Is the current node an end node with no further edges to traverse?
        """

How would I adapt a (recursive or non-recursive) algorhithm like these but without needing to pass the graph itself into the recursive function:

data = {1 : [2,3],   # Directed acyclic graph adjacency list
        2 : [3],
        3 : [4,5],
        4 : [5]}

def dfs(data, path, paths = []):   
    datum = path[-1]              
    if datum in data:
        for val in data[datum]:
            new_path = path + [val]
            paths = dfs(data, new_path, paths)
    else:
        paths += [path]
    return paths

See how this algorithm (and all I can find online) has you passing the entire data structure in? I don't have the graph as a data structure and need to discover it during traversal, so I need to be able to keep some kind of state on my current pathway, and I'm having trouble seeing clearly how to store that state reliably.

I've gotten this far:

    list_of_all_paths = []
    g = TheGraphToTraverse()
    first_node = g.get_the_first_node()
    list_of_all_paths = dfs(g, first_node, list_of_all_paths)

    def dfs(g: TheGraphToTraverse, path, all_paths):

        # check for end state
        if g.is_this_node_an_end_node:
            all_paths += [path]
            g.go_back_one_step()
            # I recognize I'm not sure how to store 
            # the previous edges I've stored for this path
            all_paths = dfs(g, path, all_paths)
        else:
            edges = g.get_the_list_of_edges_from_the_current_node()
            # how do I know which of these I've already tried?
            # how do I know if this is a list of 
            # edges I've already tried?
            for edge in edges:
                g.follow_this_edge(edge)
                all_paths = dfs(g, path, all_paths)
        return all_paths

Any help or adapted algorithm, even if it's not in Python, or is not recursive is fine. Thanks so much for engaging with this.

Is there a specific reason why you do not construct a graph first? — Jasmijn, Aug 08 '21 at 20:46
I don't understand. Why does your Graph class have a concept of a "current" node? — Karl Knechtel, Aug 08 '21 at 21:05
Why do you have to ask *the Graph* what the edges are of that node, rather than asking any arbitrary node what its outbound edges are? Even if that information is computed on the fly, it should be possible from the node instance - why not? — Karl Knechtel, Aug 08 '21 at 21:18
Hey @KarlKnechtel, as I said above, the thing I'm testing is a weird HTML/JavaScript structure I'm navigating via selenium by actually clicking on links, and then executing an XPath search statement to see what's links/edges are on the current node. Until I've actually traversed to a node, I can't see what's there. — danieltalsky, Aug 08 '21 at 21:20
Okay, but *why does that cause a problem*? Having traversed to a node, you can figure out the edges *of that node*, by *examining* the node, right? And you *don't* need to rely on some other Graph interface having "remembered" a "current" Node, because you're *already there and have that instance*. So really all you need to do is replace the `data[datum]` lookup with code that *actually* gets you the successor nodes from `datum`. — Karl Knechtel, Aug 08 '21 at 21:27
I guess my psuedo code class name is not strictly accurate. Perhaps I should have called it `GraphNodeTraverser`? I was mostly trying to get across that I needed to actually go back and forward to see which nodes were available. — danieltalsky, Aug 08 '21 at 21:29
Sure, but the point is that trying to put that logic in a separate class - and have it "keep track of" a "current" Node - is making the problem needlessly more difficult. The entire point of writing a recursive algorithm for a graph structure is that you always know what the "current node" is, without any need for an interface to "follow this edge" (that happens automatically when you make the recursive call) or "go back one step" (that happens automatically when the recursive call returns). — Karl Knechtel, Aug 08 '21 at 21:31
Yes, but in this case it's a kind of "Choose Your Own Adventure" style web application. If you've gotten to the end of a path, you can't just rely on recursion to restore the web page to its previous state. You actually have to HIT a back button to get back to a previous node. — danieltalsky, Aug 08 '21 at 21:33

score 1 · Accepted Answer · answered Aug 08 '21 at 21:01

You do not currently expose a method to get any node other than the first node, so I'm assuming you're looking for the edges only?

I've written a generator, because I've found that using lists in recursive functions tends to lead to bugs where lists are mutated when they shouldn't be or vice versa. For the same reason, I've used tuples instead of lists for sequences of edges.

def dfs(g: TheGraphToTraverse, current_path=()):
    '''Depth first search on an unbuilt graph. ``current_path`` is a tuple of edges, forming a single traversal path.

    Returns an iterator of tuples of edges.
    '''
    if g.is_this_node_an_end_node():
        yield current_path
    else:
        edges = g.get_the_list_of_edges_from_the_current_node()
        for edge in edges:
            # A "sandwich" of graph traversal management with the recursion in between.
            # This ensures that every follow_this_edge is paired with a go_back_one_step.
            g.follow_this_edge(edge)
            yield from dfs(g, current_path + (edge,))
            g.go_back_one_step()

g = TheGraphToTraverse()
list_of_all_paths = list(dfs(g))  # type: list[tuple[edge_type, ...]]

I am honestly stunned that the answer is this simple (and quite pythonic). I wasn't familiar with the `yield from` and had to read up on it. Please allow me to try and implement this before accepting the answer but I definitely will. — danieltalsky, Aug 08 '21 at 21:26
Also, just... thank you SO much. I have asked this question in different places, including a few here on SO, slowly refining it to get the idea across and you're clearly the closest to any kind of real help so I am filled with gratitude. — danieltalsky, Aug 08 '21 at 21:34

score 1 · Answer 2 · answered Aug 08 '21 at 21:50

Based on your description of the problem and the comment discussion, I am pretty confident that you only need the following abstractions:

def get_edges_from(node):
    # examine the current node to determine where we can go.
    # returns a list of graph edges to follow.
    # If the logic for following an edge requires the current node,
    # then include that information in the 'edge' representation.

def get_node(edge):
    # follow an edge and return the node at the other end.

Let's look again at the original code:

def dfs(data, path, paths = []):   
    datum = path[-1]              
    if datum in data:
        for val in data[datum]:
            new_path = path + [val]
            paths = dfs(data, new_path, paths)
    else:
        paths += [path]
    return paths

Here, data is a representation of the graph, which allows for an easy way to look up the successor nodes directly: data[datum]. datum is the "current" node in our recursion (we don't need any tools to track that externally, because it's implicit in the recursion process; and we don't need a way to "go back" be cause recursion handles that too). We've abstracted that process into global functions that operate on a node, so we don't need to pass that in. Instead, we just replace the lookup.

Separately, we don't need a special way to tell us whether the node is a leaf; we just check whether we actually recursed from there. The existing code checks if datum in data: to do this (which would cause a bug if the structure contained an empty list for successor nodes). We will instead use flag logic locally.

Finally, I want to clean up the logic that returns the paths value, because it's already being mutated in order to store the overall result. I've also removed the default value because mutable default values are bad.

Applying those changes, along with some cosmetic name changes, gives:

def dfs(path, paths):   
    current_node = path[-1]
    is_leaf = True
    for edge in get_edges_from(current_node):
        new_path = path + [get_node(edge)]
        dfs(data, new_path, paths)
        is_leaf = False
    if is_leaf:
        paths.append(path)

Then we can add a helper to start the recursion:

def dfs_from(root):
    paths = []
    dfs([root], paths) # populates the `paths` list
    return paths

It is, of course, possible to improve this. The current algorithm doesn't cache results for joins: that is, if there are multiple ways to reach a node in the middle of the graph, the algorithm will repeat the traversal of every path forward from that node, for each path leading up to it. If I were starting from scratch, I also would definitely take a recursive generator approach, rather than building up a list passed around in the recursion. I just wanted to show how little modification is actually required.

On the other hand, the recursive generator approach might make it harder to implement the optimization I hinted at. You couldn't just `functools.lru_cache` results, for example, because the cache would store generator objects that get exhausted the first time they're iterated over. — Karl Knechtel, Aug 08 '21 at 21:58
Also, thank you so much for taking the time to explain what I was missing. I really appreciate it. — danieltalsky, Aug 09 '21 at 02:03

How do I enumerate all paths in a Directed Acyclic Graph structure using Python, when the data structure can only be discovered via traversal?

2 Answers2