9

I have a big csv file which lists connections between nodes in a graph. example:

0001,95784
0001,98743
0002,00082
0002,00091

So this means that node id 0001 is connected to node 95784 and 98743 and so on. I need to read this into a sparse matrix in numpy. How can i do this? I am new to python so tutorials on this would also help.

Iterator
  • 20,250
  • 12
  • 75
  • 111
Ankur Chauhan
  • 1,393
  • 2
  • 17
  • 37

3 Answers3

12

Example using lil_matrix (list of list matrix) of scipy.

Row-based linked list matrix.

This contains a list (self.rows) of rows, each of which is a sorted list of column indices of non-zero elements. It also contains a list (self.data) of lists of these elements.

$ cat 1938894-simplified.csv
0,32
1,21
1,23
1,32
2,23
2,53
2,82
3,82
4,46
5,75
7,86
8,28

Code:

#!/usr/bin/env python

import csv
from scipy import sparse

rows, columns = 10, 100
matrix = sparse.lil_matrix( (rows, columns) )

csvreader = csv.reader(open('1938894-simplified.csv'))
for line in csvreader:
    row, column = map(int, line)
    matrix.data[row].append(column)

print matrix.data

Output:

[[32] [21, 23, 32] [23, 53, 82] [82] [46] [75] [] [86] [28] []]
Community
  • 1
  • 1
miku
  • 181,842
  • 47
  • 306
  • 310
  • Exactly what I needed. Any good resources for scipy that you can recommend? – Ankur Chauhan Dec 21 '09 at 09:54
  • One small question. The numbers in the csv are not the indices. they are Ids ie the file starts with 0001001,9304045 0001001,9308122 0001001,9309097 0001001,9311042 0001001,9401139 0001001,9404151 0001001,9407087 0001001,9408099 0001001,9501030 0001001,9503124 So how do i convert these IDs to numerical indices, the ID server the purpose of just identifying nodes, they may be replaced by equivalent indices if they are unique. How do I accomplish this. I know I can just make rows and columns as big as the largest ID but that seems wasteful as the nodes like with indices 0 - 1001 are wasted. – Ankur Chauhan Dec 21 '09 at 10:01
  • i understand your concern and i assume, there is no one best way to 'compress' your data to the relevant elements. it depends largely on your goal, what you want to do with the data later. e.g. you could use a 'mapping dictionary' which maps the actual ids to some smaller numerical values ... – miku Dec 21 '09 at 10:17
  • If you do want to 'squeeze' your indices so that they start at 0 and go up in increments of 1 to some maximum, why not (1) sort them producing `sorted_ixs` (`sorted_ixs = ixs; sorted_ixs.sort()`), (2) `zip(sorted_ixs, range(len(sorted_ixs))` producing a list of pairs matching an index with a 'squeezed index', (3) use the list as a 'translation table' from old to new indices. – Michał Marczyk Dec 21 '09 at 21:36
  • Actually this will also sort `ixs`, I think; use `sorted_ixs = ixs[:]` if you want to keep your unsorted `ixs` around. – Michał Marczyk Dec 21 '09 at 21:37
2

If you want an adjacency matrix, you can do something like:

from scipy.sparse import *
from scipy import *
from numpy import *
import csv
S = dok_matrix((10000,10000), dtype=bool)
f = open("your_file_name")
reader = csv.reader(f)
for line in reader:
    S[int(line[0]),int(line[1])] = True
tkerwin
  • 9,559
  • 1
  • 31
  • 47
2

You might also be interested in Networkx, a pure python network/graphing package.

From the website:

NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

>>> import networkx as nx
>>> G=nx.Graph()
>>> G.add_edge(1,2)
>>> G.add_node("spam")
>>> print G.nodes()
[1, 2, 'spam']
>>> print G.edges()
[(1, 2)]
mavnn
  • 9,101
  • 4
  • 34
  • 52