2

I have a graph with an adjacency matrix shape (adj_mat.shape = (4000, 4000)). My current problem involves finding the list of path lengths (the sequence of nodes is not so important) that traverses from the source (row = 0 ) to the target (col = trans_mat.shape[0] -1).

I am not interested in finding the path sequences; I am only interested in propagating the path length. As a result, this is different from finding all simple paths - which would be too slow (ie. find all paths from source to target; then score each path). Is there a performant way to do this quickly?


DFS is suggested as one possible strategy (noted here). My current implementation (below) is simply not optimal:

# create graph
G = nx.from_numpy_matrix(adj_mat, create_using=nx.DiGraph())

# initialize nodes
for node in G.nodes:
    G.nodes[node]['cprob'] = []

# set starting node value
G.nodes[0]['cprob'] = [0]

def propagate_prob(G, node):

    # find incoming edges to node
    predecessors = list(G.predecessors(node))
    curr_node_arr = []        

    for prev_node in predecessors:
        # get incoming edge weight
        edge_weight = G.get_edge_data(prev_node, node)['weight']

        # get predecessor node value
        if len(G.nodes[prev_node]['cprob']) == 0:                
            G.nodes[prev_node]['cprob'] = propagate_prob(G, prev_node)            
        prev_node_arr = G.nodes[prev_node]['cprob']   

        # add incoming edge weight to prev_node arr
        curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(prev_node_arr)])

    # update current node array
    G.nodes[node]['cprob'] = curr_node_arr
    return G.nodes[node]['cprob']

# calculate all path lengths from source to sink 
part_func = propagate_prob(G, 4000)

batlike
  • 668
  • 1
  • 7
  • 19
  • I do not think you can find path lengths without finding paths. – DYZ Oct 15 '20 at 05:16
  • Maybe I misunderstood. I can imagine propagating a list of incoming edge weights at each node. The connecting edge weight + all preceding incoming edge weights delineates all possible path lengths. – batlike Oct 15 '20 at 05:17
  • But the same also builds all paths. – DYZ Oct 15 '20 at 05:33
  • I edited my question per your question about the semantics of finding paths vs finding path lengths. But is finding the path length without recording the path sequence really finding the path? I am not so sure. In HMMs for instance, the Forward algorithm need not enumerate all possible paths that emit a sequence of interest; it simply propagates the probability propagated from the previous states. – batlike Oct 15 '20 at 05:40

2 Answers2

1

I don't have a large example by hand (e.g. >300 nodes), but I found a non recursive solution:

import networkx as nx

g = nx.DiGraph()

nx.add_path(g, range(7))

g.add_edge(0, 3)
g.add_edge(0, 5)
g.add_edge(1, 4)
g.add_edge(3, 6)

# first step retrieve topological sorting
sorted_nodes = nx.algorithms.topological_sort(g)

start = 0
target = 6

path_lengths = {start: [0]}

for node in sorted_nodes:
    if node == target:
        print(path_lengths[node])
        break

    if node not in path_lengths or g.out_degree(node) == 0:
        continue
    new_path_length = path_lengths[node]
    new_path_length = [i + 1 for i in new_path_length]
    for successor in g.successors(node):
        if successor in path_lengths:
            path_lengths[successor].extend(new_path_length)
        else:
            path_lengths[successor] = new_path_length.copy()

    if node != target:
        del path_lengths[node]

Output: [2, 4, 2, 4, 4, 6]

If you are only interested in the number of paths with different length, e.g. {2:2, 4:3, 6:1} for above example, you could even reduce the lists to dicts.

Background

Some explanation what I'm doing (and I hope works for larger examples as well). First step is to retrieve the topological sorting. Why? Then I know in which "direction" the edges flow and I can simply process the nodes in that order without "missing any edge" or any "backtracking" like in a recursive variant. Afterwards, I initialise the start node with a list containing the current path length ([0]). This list is copied to all successors, while updating the path length (all elements +1). The goal is that in each iteration the path length from the starting node to all processed nodes is calculated and stored in the dict path_lengths. The loop stops after reaching the target-node.

Sparky05
  • 4,692
  • 1
  • 10
  • 27
  • Yes, the topological ordering is important - you are completely right. Sorting it into the order requires O(V+E) time. I will give it a shot on the larger graphs to test performance. – batlike Oct 16 '20 at 14:54
  • I gave it a shot but it flags after 200 nodes. I suspect it has more to do with the overhead for each networkx call more than anything. I hope you don't mind, I upvoted your response but I'm going to leave the question open. – batlike Oct 23 '20 at 05:24
  • With flags you mean, it needs too much memory or that the processing of a single node takes too long? Have you tried the `dict` alternative (if that would fit your requirements)? – Sparky05 Oct 23 '20 at 08:49
  • Yes the latter - even without recursion a single node computation takes too long with Networkx. I think maybe with LightGraphs (in Julia?) it might be faster. I will try your dictionary approach, thank you for the suggestion. – batlike Oct 23 '20 at 15:57
0

With igraph I can calculate up to 300 nodes in ~ 1 second. I also found that accessing the adjacency matrix itself (rather than calling functions of igraph to retrieve edges/vertices) also saves time. The two key bottlenecks are 1) appending a long list in an efficient manner (while also keeping memory) 2) finding a way to parallelize. This time grows exponentially past ~300 nodes, I would love to see if someone has a faster solution (while also fitting into memory).

import igraph

# create graph from adjacency matrix
G = igraph.Graph.Adjacency((trans_mat_pad > 0).tolist())

# add edge weights
G.es['weight'] = trans_mat_pad[trans_mat_pad.nonzero()]

# initialize nodes
for node in range(trans_mat_pad.shape[0]):
    G.vs[node]['cprob'] = []

# set starting node value
G.vs[0]['cprob'] = [0]

def propagate_prob(G, node, trans_mat_pad):

    # find incoming edges to node
    predecessors = trans_mat_pad[:, node].nonzero()[0] # G.get_adjlist(mode='IN')[node]
    curr_node_arr = []        

    for prev_node in predecessors:
        # get incoming edge weight
        edge_weight = trans_mat_pad[prev_node, node] # G.es[prev_node]['weight']

        # get predecessor node value
        if len(G.vs[prev_node]['cprob']) == 0:
            curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + propagate_prob(G, prev_node, trans_mat_pad)])
        else: 
            curr_node_arr = np.concatenate([curr_node_arr, np.array(edge_weight) + np.array(G.vs[prev_node]['cprob'])])
    ## NB: If memory constraint, uncomment below
    # set max size
    # if len(curr_node_arr) > 100:
    #     curr_node_arr = np.sort(curr_node_arr)[:100]
    
    # update current node array
    G.vs[node]['cprob'] = curr_node_arr
    return G.vs[node]['cprob']

# calculate path lengths
path_len = propagate_prob(G, trans_mat_pad.shape[0]-1, trans_mat_pad)
batlike
  • 668
  • 1
  • 7
  • 19