How to obtain all pairs shortest path length for large graph with 100k(100,000) nodes?

Question

Shortest path with 10,0000 nodes, 5 billion edges.

Suppose matrix $A$ is the symmetric weighted ajacency matrix (Undirected Graph), and element $A_{ij}$ represents the weight of edge $e_{ij}$, suppose the matrix size is $100000\times 100000$, so here are $n(n-1)/2=4999950000$ edges, essentially 5 billion edges, I wish to obtain two matrixes $L$ and $W$:

Let $P_{ij}=\left{v_1, v_2, \cdots, v_n\right}$ represents one of the shortest path with $n$ nodes from $i$ to $j$ (or $j$ to $i$ due to symmetric), hence we can calculate the following two quanties [L_{ij} = n-1, W_{ij} = \sum_{i=1}^{n-1}A_{v_i,v_{i+1}}]

(1) The Matrix $L$ of weighted shortest path length, whiere $L_{ij}$ represents the length of shortest from node $i$ to node $j$.

(2) The matrix $W$ of shortest path cumulative edge weights, where $W_{ij}$ represents the cumulative edge weight along the path.

Here I got that for all pair shortest path, there are algorithms like Floyd Warshall, Johnson algorithms, but it would require hundreds of days to calculate all pairs path length, what is the possible suitable solution for this?

I tried Python igraph with python code

import tqdm
import numpy as np
import pandas as pd
import igraph as ig

def all_pairs_shortest_length(ajacency_matrix):
    m, n = ajacency_matrix.shape # ajacency_matrix is weighted
    L = np.zeros((m,n), dtype=int) # L is symmetric
    G = ig.Graph.Weighted_Adjacency(ajacency_matrix, mode = 'undirected') # 340G memory to construct graph with 65000 nodes
    for v in tqdm.trange(m):
        # v as start node, only calculate target nodes  from v+1 to m
        paths_rest = ig.GraphBase.get_all_shortest_paths(G, v=v, to = range(v+1, m), weights=G.es["weight"], mode="out")
        target_with_length = pd.DataFrame([[path[-1],len(path)-1] for path in paths_rest], columns=['target','length']).drop_duplicates(ignore_index=True)
        # for weighted shortest path, choose the path with minimum number of edges
        L[v,v+1:] = target_with_length.groupby('target', group_keys=True).apply(min).length
    L = L + L.T
    M = np.array(ig.GraphBase.distances(G, weights=G.es["weight"]))
    return L, M

if __name__=='__main__':
    test_N = 100000
    b = np.random.random_integers(1,100, size=(test_N, test_N))
    ajacency_matrix = (b + b.T)/2
    np.fill_diagonal(ajacency_matrix, 0)
    L, M = all_pairs_shortest_length(ajacency_matrix)

You seem to expect $ to be some recognized markup, which it is not. Beyond that, I'm afraid, that you note, that O(n³) complexity see [here](https://stackoverflow.com/q/10779054/1435475) becomes really huge for large n. — guidot, Mar 31 '23 at 07:00
Yep, sorry for the typo, my first time to ask a question, here the reason I calculate L, is that I hope to see the path length distribution form a group of nodes to another group of nodes, the data that I am actually processing contains 65000 nodes. So for this number of nodes, it is not possible to obtian the result? — Chibing.Xiang, Mar 31 '23 at 07:11
"10,0000 nodes, 5 billion edges." Are you sure? 5,000,000,000 / 10,000 = 500,000. It is an unusual graph where the average vertex has half a million edges. What is your graph modeling? — ravenspoint, Mar 31 '23 at 14:39

ravenspoint · Answer 1 · 2023-03-31T15:49:32.163

The Dijkstra algorithm is the fastest way to find the shortest paths between nodes. One run of the algorithm finds the shortest paths from one node to every other node. To find the shortest path between every pair of nodes will require many runs on a directed graph. An undirected graph will require fewer runs since the shortest path between u and v is just the reverse of the path between v and u. You do not tell us if your graph is directed or not.

In any case, this does take a long time for large graphs, though if your implementation of Dijkstra is well coded you will need minutes rather than days.

Getting reasonable performance from code written in Python will be a challenge. If you care about performance, you should consider a compiled coding language such as C++ which can give up to 50 times faster run times than the same algorithm coded in Python.

Here is some C++ code implementing the Dijkstra algorithm.

void dijsktra(
    const cGraph &g,
    int start,
    std::vector<double> &dist,
    std::vector<int> &pred)
{
    // shortest distance from start to each node
    dist.clear();
    dist.resize(g.vertexCount(), INT_MAX);

    // previous node on shortest path to each node
    pred.clear();
    pred.resize(g.vertexCount(), -1);

    std::vector<bool> sptSet(g.vertexCount(), false); // sptSet[i] will be true if vertex i is included in shortest
                                                      // path tree or shortest distance from src to i is finalized

    // Distance of source vertex from itself is always 0
    dist[start] = 0;
    pred[start] = 0;

    // Find shortest path for all vertices
    for (int count = 0; count < g.vertexCount() - 1; count++)
    {
        // Pick the minimum distance vertex from the set of vertices not
        // yet processed. u is always equal to src in the first iteration.
        int min = INT_MAX, uidx;
        for (int vidx = 0; vidx < g.vertexCount(); vidx++)
            if (sptSet[vidx] == false && dist[vidx] <= min)
            {
                min = dist[vidx];
                uidx = vidx;
            }
        if (min == INT_MAX)
        {
            // no more nodes connected to start
            break;
        }

        // Mark the picked vertex as processed
        sptSet[uidx] = true;

        // Update dist value of the adjacent vertices of the picked vertex.
        for (auto vp : g.adjacentOut(uidx))
        {
            if (sptSet[vp])
                continue; // already processed

            // Update dist[v] only if total weight of path from src to  v through u is
            // smaller than current value of dist[v]
            double cost = atof(g.rEdgeAttr(g.find(uidx, vp), 0).c_str());
            if (dist[uidx] + cost < dist[vp])
            {
                dist[vp] = dist[uidx] + cost;
                pred[vp] = uidx;
            }
        }
    }
}

You can find the code for the complete path finding application at https://github.com/JamesBremner/PathFinder

Run time for running the Dijkstra algorithm to find the cheapest paths from one randomly selected vertex to every other vertex on large graphs downloaded from https://dyngraphlab.github.io/. Longest result from 3 runs using the graphex application.

Vertex Count	Edge Count	Run time ( secs )
10,000	50,000	0.3
145,000	2,200,000	40

Here is an introduction the Dijkstra algorithm https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm

well, thanks for your answer and suggestions, yes, suppose it take 1 min (actually round 10 mins) for one round to calculate one node a to the rest n-1 nodes , and to calcculate n(n-1)/2 pairs, that is equally n/2 rounds (actually n-1 rounds), hence the minimum time is n/2 minutes, here n=100000, 50000 minutes make it 50000/60=833.3 hours, and 50000/60/24=35 days, that's what I am refering to. And again, thank you for yoru detailed infromation very much, it seems your suggestion with c++ would be the best option, I might consider a selected group of pairs rather than all pairs. — Chibing.Xiang, Apr 01 '23 at 07:16
It requires 0.3 seconds to calculate the path from one node to every other in a graph with 10,000 vertices. — ravenspoint, Apr 01 '23 at 17:28

How to obtain all pairs shortest path length for large graph with 100k(100,000) nodes?

1 Answers1