I've been using the apgl 0.8.1. library to analyse a massive network (40 million of nodes). I tried apgl because it works fine with scipy sparse matrices, and now I can load a sparse matrix into memory to perform analyses. I'm interested in obtaining a subgraph with a desidered size of the all network, after a Breadth First Search Analysis.
I'm reading an adjancency list from pandas to build a sparse matrix. Consider this sample network (Node,Node,Weight) called network
:
1 5 1
5 1 1
1 2 1
5 6 1
6 7 1
7 5 1
5 2 1
2 3 1
3 4 1
4 2 1
3 8 1
9 10 1
1 11 1
11 12 1
12 13 1
13 1 1
5 14 1
This is the sample code I'm using:
# Import Modules
import pandas as pd
import numpy as np
import scipy as sp
from apgl.graph import SparseGraph
# Load Network
df = pd.read_csv(network,sep='\s+',header=None,names=['User1','User2','W'])
# Convert the numpy array from pandas into a NxN square matrix
# Read Numpy array from Pandas
arr = df.values
# Set matrix shapes
shapes = tuple(arr.max(axis=0)[:2]+1)
# Build a Sparse Matrix
matrix = sp.sparse.csr_matrix((arr[:, 2], (arr[:, 0], arr[:, 1])),
shape=shape,
dtype=arr.dtype)
# Set the total number of nodes
numVertices = shape[0]
# Inizialize Graph
graph = SparseGraph(numVertices, undirected=False, W=matrix, frmt='csr')
# Perform BFS starting from one one --> set output to np.array
startingnode = 5
bfs = np.array(graph.breadthFirstSearch(startingnode))
# Return SubGraph with a list of nodes
# Set limit
limit = 5
subgraph = graph.subgraph(bfs[:limit])
which returns:
bfs = [ 5 1 2 6 14 11 3 7 12 4 8 13]
subgraph = SparseGraph: vertices 5, edges 6, directed, vertex storage GeneralVertexList, edge storage <class 'scipy.sparse.csr.csr_matrix'>
So I set up a limit of 5 nodes of the resulting subgraph. But the nodes are chosen from the first to the fifth, no matter the shape of the search. I mean, the Breadth First Search algorithm is looking for neighbours of neighbours and so on, and I would like to set a limit which includes the search in the all final neighbours level of the last node chosen. In the example, the subgraph contains the first five nodes of the BFS array, so:
subgraph = [5 1 2 6 14]
but I would like to include also nodes 7 (which complete the neighbours level starting from node 5) and 3 (which complete the search in the level of node 2). So the resulting array of nodes should be:
subgraph = [5 1 2 3 6 7 14]
Any help would be appreciated.
EDIT: What I would like to do is to find a subgraph of the entire graph starting from a random node, and performing a BFS algorithm till different size of the subgraph, e.g. 5 million, 10 million and 20 million nodes. I would like to complete each level of neighbours of nodes before stopping, so it doesn't matter if the number of nodes is 5 million or 5 million and 100, if the last 100 nodes are needed to complete a level of all neighbours of the last node found in the search.