0

I've been using the apgl 0.8.1. library to analyse a massive network (40 million of nodes). I tried apgl because it works fine with scipy sparse matrices, and now I can load a sparse matrix into memory to perform analyses. I'm interested in obtaining a subgraph with a desidered size of the all network, after a Breadth First Search Analysis.

I'm reading an adjancency list from pandas to build a sparse matrix. Consider this sample network (Node,Node,Weight) called network:

1 5 1
5 1 1
1 2 1
5 6 1
6 7 1
7 5 1
5 2 1
2 3 1
3 4 1
4 2 1 
3 8 1
9 10 1
1 11 1
11 12 1
12 13 1
13 1 1
5 14 1

This is the sample code I'm using:

# Import Modules
import pandas as pd
import numpy as np
import scipy as sp
from apgl.graph import SparseGraph

# Load Network
df = pd.read_csv(network,sep='\s+',header=None,names=['User1','User2','W'])

# Convert the numpy array from pandas into a NxN square matrix
# Read Numpy array from Pandas
arr = df.values

# Set matrix shapes
shapes = tuple(arr.max(axis=0)[:2]+1)

# Build a Sparse Matrix
matrix = sp.sparse.csr_matrix((arr[:, 2], (arr[:, 0], arr[:, 1])), 
                        shape=shape,
                        dtype=arr.dtype)

# Set the total number of nodes
numVertices = shape[0]

# Inizialize Graph
graph = SparseGraph(numVertices, undirected=False, W=matrix, frmt='csr')

# Perform BFS starting from one one --> set output to np.array
startingnode = 5
bfs = np.array(graph.breadthFirstSearch(startingnode))

# Return SubGraph with a list of nodes
# Set limit
limit = 5
subgraph = graph.subgraph(bfs[:limit])

which returns:

bfs = [ 5  1  2  6 14 11  3  7 12  4  8 13]
subgraph = SparseGraph: vertices 5, edges 6, directed, vertex storage GeneralVertexList, edge storage <class 'scipy.sparse.csr.csr_matrix'>

So I set up a limit of 5 nodes of the resulting subgraph. But the nodes are chosen from the first to the fifth, no matter the shape of the search. I mean, the Breadth First Search algorithm is looking for neighbours of neighbours and so on, and I would like to set a limit which includes the search in the all final neighbours level of the last node chosen. In the example, the subgraph contains the first five nodes of the BFS array, so:

subgraph = [5 1 2 6 14]

but I would like to include also nodes 7 (which complete the neighbours level starting from node 5) and 3 (which complete the search in the level of node 2). So the resulting array of nodes should be:

subgraph = [5 1 2 3 6 7 14]

Any help would be appreciated.

EDIT: What I would like to do is to find a subgraph of the entire graph starting from a random node, and performing a BFS algorithm till different size of the subgraph, e.g. 5 million, 10 million and 20 million nodes. I would like to complete each level of neighbours of nodes before stopping, so it doesn't matter if the number of nodes is 5 million or 5 million and 100, if the last 100 nodes are needed to complete a level of all neighbours of the last node found in the search.

Fabio Lamanna
  • 20,504
  • 24
  • 90
  • 122
  • Can you please provide some more information about what it is exactly you are trying to discover in the network? Also, have you considered Networkx? (https://networkx.github.io/) – A_A Apr 10 '15 at 09:56
  • Yes, but since the network is so big, Networkx can't handle those amount of information, even with 200 (two hundred) GB of RAM. I'm editing the answer with more information thanks. – Fabio Lamanna Apr 10 '15 at 12:36

1 Answers1

1

BFS and DFS operate over undirected graphs which could be acyclic (i.e. trees) or not.

During their operation, both algorithms may encounter back-edges. That is, edges that if traversed would bring the algorithm to consider nodes already visited.

These edges and the nodes at their ends are not (normally) reported in the output of BFS or DFS.

Networkx (https://networkx.github.io/) contains dfs_labeled_edges which returns an iterator with a characteristic for each edge.

As far as your code is concerned:

limit = 5
subgraph = graph.subgraph(bfs[:limit])

This is not going to search for all BFS subgraphs of 'length' 5. Because of bfs[:5] this will always be looking at the first 5 entries of the BFS' output.

If you are looking for cycles, perhaps you can use a different algorithm (For example, this is again from Networkx), or extract your subgraph from the original network and then use DFS to label its edges, or enumerate all of its simplest paths and try to work on them.

Hope this helps.

Supplementary Information:

(I am also taking this older question of yours into account here: https://stackoverflow.com/questions/29169022/scipy-depth-breadth-first-search-tree-size-limit)

You want to extract a proper subgraph U(W,C) from your main graph G(V,E) with the order of U being much smaller than the order of G (|U|<<|G|) and furthermore with U being the BFS of G starting on some node v_i of G(V,E).

There are two ways you can do this:

  1. Write your own BFS where you can add counters for the depth of traversal and nodes traversed and use them to interrupt the algorithm wherever you like. Due to the extremely large number of nodes that you have, you should look into the iterative rather than the recursive version of the algorithm. For more information, please see: Way to go from recursion to iteration and also to some extent http://en.wikipedia.org/wiki/Depth-limited_search . This approach will be more efficient because the BFS would not have to go through the whole graph.

  2. Truncate the output of an existing BFS and use the remainder nodes as your starting points for the next step.

In either case, your algorithm will contain one more iteration step and will end up looking something like this (here for option #2):

#Given graph G(V,E) and NNodeLimit (Natural number)
#Produce a set Q of BFS proper subgraphs bearing the characteristics of U.
Q = []
nextBFSNode = [0]

while len(nextBFSNode):
    #Pop the node
    startingPoint = nextBFSNode.pop()
    #Build a BFS graph starting from some node
    q = BFS(G, startingPoint)
    #Truncate its output and save it to the list.
    Q.append(subgraph(G,q[:NNodeLimit]))
    #Add the remaining nodes as future starting points
    nextBFSNode = set(q[NNodeLimit+1:])

There will of course be considerable overlap between different trees.

Hope this helps.

Community
  • 1
  • 1
A_A
  • 2,326
  • 18
  • 27
  • Thanks again. What I would like to explain is that my bfs array is a good subgraph performed with the BFS algorithm. But in my real network this can be really really big, so I'd like to extract from that only a smaller portion, but keeping, if possible, all the latest level of neighbours in the last node visited. – Fabio Lamanna Apr 10 '15 at 12:46
  • Finally, I tried to work with Networkx, but I had to move to scipy sparse matrices because of the size in memory of the graph in Networkx. Working with sparse matrices is easier, but difficult as well to perform those kind of algorithms. – Fabio Lamanna Apr 10 '15 at 12:49
  • Thank you for clarifying, have amended original response with some more information. I must say that I still do not see what exactly you are trying to do or why you are trying to do it this way. In any case, you might also want to have a look at: http://stackoverflow.com/questions/7978840/what-scalability-issues-are-associated-with-networkx – A_A Apr 10 '15 at 14:00
  • Thanks again for your advises. You have understood well my problem. After extracting the subgraphs I will have to perform some clusters analysis on them, so I need different samples subgraphs in order to compute statistical analysis. – Fabio Lamanna Apr 10 '15 at 15:41
  • You are welcome, if you find the answer actually working for you, please consider accepting it for this question too. – A_A Apr 11 '15 at 08:49