2

I have a text file with each line indicating an edge on a graph, for example

2 5 1

indicates an edge of weight 1 between nodes 2 and 5. I want to create a sparse adjacency matrix using these tuples. Typically, I'd initialize a sparse matrix as

G = scipy.sparse.lil_matrix((n,n))

where n is the number of nodes in the graph. But in this case, I do not know what 'n' is. Is there a more efficient way to create the matrix than looping over the lines of the file to find the max node index, creating the lil_matrix and then again looping over the file ? My current implementation is this:

n = 0
with open(gfile) as f:
    for line in f:
        temp = map(int,line.split())
        n = np.max([n,temp[0],temp[1]])
G = sp.lil_matrix((n,n))
with open(gfile) as f:
    for line in f:
        temp = map(int,line.split())
        G[temp[0],temp[1]] = temp[2]
hpaulj
  • 221,503
  • 14
  • 230
  • 353
NSR
  • 313
  • 6
  • 15

1 Answers1

3

The original, and still prototypical, way of creating a sparse matrix is to collect all inputs in row, col, data arrays (or lists), and use coo_matrix to construct the matrix. Shape can be deduced from those inputs (maximum index values), or given as a parameter.

To adapt your code

row, col, data = [],[],[]
with open(gfile) as f:
    for line in f:
        temp = map(int,line.split())
        # G[temp[0],temp[1]] = temp[2]
        data.append(temp[2])
        row.append(temp[0])
        col.append(temp[1])
G = sparse.coo_matrix((data, (row,col))

List appends are at least as fast as line reads, and better than sparse matrix inserts, even lil (lil assignment involves list appends as well).

I suspect you could also do:

A = np.genfromtxt(gfile, dtype=int) # default white space delimiter
# A should now be a 2d 3 column array
G = sparse.coo_matrix((A[:,2], (A[:,0], A[:,1]))

That is read the whole file with genfromtxt or loadtxt and create the sparse matrix from the resulting columns.

(When I made sparse matrices in MATLAB years ago, I used this sort of data, col, row initialization, though with a clever use of indexing to assemble those arrays from finite element blocks without loops.)

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • thanks! On your loadtxt comment: I think that loads the entire txt file, which i wanted to avoid, since some of these files can be pretty large. – NSR Aug 08 '16 at 16:47
  • New `genfromtxt` takes a `max_rows` parameter. And all versions let you feed lines via your own filter. In any case, `genfromtxt` isn't any faster than doing your own line read and parse. It's doing the same thing. – hpaulj Aug 08 '16 at 17:33