2

I have a DAG that looks like this: Example DAG

I want to extract all the paths constituted by 4 nodes in this graph.

My expected result should look like this:

N1 -> N2 -> N3 -> N4

N1 -> N2 -> N3 -> N5

N1 -> N3 -> N4 -> N5

N2 -> N3 -> N4 -> N5

My current attempt looks like this

def path_finder(n1):
    paths = []
    if DAG.has_node(n1):
        for n2 in DAG.successors(n1):
            for n3 in DAG.successors(n2):
                for n4 in DAG.successors(n3):
                    paths.append([n1, n2, n3, n4])
    return paths

I'm calling this function for each node. DAG is a global variable, more specifically it is a networkx object (DAG = networkx.DiGraph() ) This naive function is pathetically slow. Is there a more efficient strategy to do this?

I have looked at question 20262712 but was self-solved by the author of the question in rather obscure way.

Thanks

UPDATE:

Since I couldn't get any satisfactory algorithm to solve this, I ended up parallelizing the job using my naive function as a worker while dumping all the data into a queue. I used pool.imap_unordered to launch worker function and aggregated the results from queue. It still is slow (takes couple of hours for 5M nodes). I should also furnish the data for the average degree of nodes that I'm dealing with, because that will have an effect upon how fast my workers runs. But, I'll leave that out for now.

Community
  • 1
  • 1
Parashar
  • 143
  • 1
  • 12
  • Note - the backtracking described in the answer to the question you linked is basically taking advantage of the fact that once you've calculated all paths from a node, you don't need to do that again if you come across the node again (if you've saved that data). My answer uses this in a different way. – Joel Aug 27 '16 at 20:14
  • Can you say a bit about what you need this for? Are you sure you need the list rather than, say a generator? – Joel Aug 27 '16 at 20:26
  • This is a part of larger algorithm that I'm trying to develop for finding specific repetitive sequences in the human genome (which is basically a large string composed of four letters A, T, G, C). Each node here marks the position of specific repeat and edges their distances. nodes are connected only when the their distance is less a defined value. Now I want to identify the blocks of this repeats as they can be meaningful in any of combination of four repeats. – Parashar Aug 27 '16 at 20:37
  • I would like to dump all the paths into a HDF5 file. I'm hoping that this will not be a fast process given that I might have upto 100M nodes. Hence I need to dump after all the costly graph traversal. – Parashar Aug 27 '16 at 20:39
  • I'm not seeing a good solution. You should check it, but I suspect your run time is dominated by `paths.append([n1, n2, n3, n4])`. If so, there's not much you'll be able to do about it. – Joel Aug 27 '16 at 20:52
  • Joel you are right there, appending to list does cost me a lot. On a 10K node graph, searching paths takes 16 seconds with 'append', while if I replace the statement with pass, runtime comes down to 3.5 seconds. I'm thinking on using queues here. Any high performance python library you can usggest for this? – Parashar Aug 27 '16 at 21:01
  • Did you try to run it with other python implementations like [PyPy](http://pypy.org/)? For simple codes speedup is usually few times. – Ante Sep 06 '16 at 08:43

3 Answers3

1

Here is a function that returns the paths of a given length between all nodes in the graph. It iterates between all sets of nodes and uses the networkx.all_simple_paths to get the paths.

import networkx as nx

g = nx.DiGraph()

g.add_nodes_from(['A','B','C','D','E'])

g.add_path(['A','B','C','D'])
g.add_path(['A','B','C','E'])
g.add_path(['A','C','D','E'])
g.add_path(['B','C','D','D'])

def find_paths(graph, number_nodes=4):
    paths = []
    for source in graph.nodes_iter():
        for target in graph.nodes_iter():
            if not source==target:
                p_source_target = nx.all_simple_paths(graph, 
                                                      source, 
                                                      target, 
                                                      cutoff=number_nodes-1)
                paths.extend([p for p in p_source_target if len(p)==number_nodes])
    return paths

find_paths(g)
# output:
[['B', 'C', 'D', 'E'],
 ['A', 'C', 'D', 'E'],
 ['A', 'B', 'C', 'E'],
 ['A', 'B', 'C', 'D']]
James
  • 32,991
  • 4
  • 47
  • 70
  • This will find all paths between all pairs of nodes. Then it selects the length 4 paths. You can dramatically speed this up by setting the cutoff to be 4, so it stops once a path is longer than 4. – Joel Aug 27 '16 at 20:10
  • Thanks James. I'm afraid that complexity of this approach might be O^2 or worse, given that you are double iterating over all nodes. I benchmarked your code and it is much slower than my naive strategy and its recursive counterpart suggested by Joel above. I appreciate the idea of using `all_simple_paths`. Thinking how to build on it in a better way. – Parashar Aug 27 '16 at 20:24
  • To be more specific, I tried running your code on Graph with 1K nodes and it took about 4 mins. Joel's strategy above took around 2.7 sec and mine took around 3.5 sec on same graph. – Parashar Aug 27 '16 at 20:33
0

Part of your problem could be that if you come across a node u as the second node in a path then you do all of the calculations to find all the paths of length 3. But then if you come across u again as the second node, you repeat all of these calculations.

So try to avoid this. We'll do it recursively calculating all the length 3 paths first (which requires calculating length 2 paths)

def get_paths(G, n):
    '''returns a dict, paths, such that paths[u] is a list of all paths 
       of length n that start from u'''
    if n == 1: #base case, return a dict so that D[u] is a
               #list of all length 1 paths starting from u.
               #it's a boring list.
        return {u: [[u]] for u in G.nodes()}
    #if we get to here n>1 (unless input was bad)
    subpath_dict = get_paths(G,n-1)  #contains all length n-1 paths, 
                                     #indexed by first node
    path_dict = {}
    for u in G:
        path_dict[u] = []  
        for v in G.successors(u):
            path_dict[u].extend([[u]+subpath for subpath in subpath_dict[v]])
    return(path_dict)

G=nx.DiGraph()
G.add_path([1,2,3,4,5,6])
G.add_path([1,3,6,8,10])

path_dict = get_paths(G,4)
path_list = []
for paths in path_dict.values():
    path_list.extend(paths)
Joel
  • 22,598
  • 6
  • 69
  • 93
  • Thanks Joel. Using recursion here is very thoughtful and apt. However when I benchmarking this code I couldn't find any performance gain over my naive strategy. Moreover, I have large network with millions of nodes. I would like to track the path search progress and recursion makes it intractable. Can we further improve this? – Parashar Aug 27 '16 at 20:17
  • That's surprised me, but looking in more detail it starts to make sense. I have to do `[u]+subpath` for each path of each successor of `u`. You're going in and calling a for loop for each successor of `ni`. Those probably have similar costs. I'll follow up a bit more in comments on your question for clarification, but I don't know that there's much that can be improved. – Joel Aug 27 '16 at 20:25
0

Number of sequences is of order |V|*d^3, where d is average node output degree. From how graph is created, d is bounded. I suppose that d is not very small (like < 5). That means, for 5M nodes graph there are > 1G paths.

Since finding one path is fast (they are short) it is not sure that DP like algorithm can help. DP like algorithms try to take an advantage of partially calculated data, so there is an overhead of storing and retrieving that data and maybe that is larger overhead than just to calculate needed partial data.

One idea is algorithm that traverse DAG in back topological order and do two things:

  • for node keep all paths that starts from that node of length 3,
  • using successors paths of length 3 print all paths of length 4.

This method can use lot of memory, but some of it can be freed for nodes that are not successor of any traverse boundary node.

Other idea is just to make simple algorithm more optimized. In your solution there are three for loops for each node. That means four for loops for all paths. Note that each loop is through nodes. It is possible to join first two loops by iterating through edges. That is due each path has to start with one edge. Algorithm is like:

for n1, n2 in DAG.edges():
  for n3 in DAG.successors(n2):
    for n4 in DAG.successors(n3):
      paths.append([n1, n2, n3, n4])

Or even simpler by first selecting middle edge:

for n2, n3 in DAG.edges():
  for n1, n4 in itertools.product(DAG.predecessors(n2), DAG.successors(n3)):
    paths.append([n1, n2, n3, n4])

Outer loop can be optimized by not selecting middle edge that starts on source node or ends on target node. But that is quite fast detected in product() method. Maybe this optimization can help by not sending not needed data into other process.

Ante
  • 5,350
  • 6
  • 23
  • 46