0

I have very large data set it is almost 450000 lines and two rows, i want to compute adjacency matrix using python, because previously i have tried to do it in matlab, and it shows memory error because of large data values. my data values also start from 100 and goes upto 450000,

Anyone can help me in this issue, as i am new to python.

I have to first import the file into python using excel sheet or notepad and then compute the adjacency matrix

A A
  • 1
  • 1
  • 1
  • 1
    Are the pairs ("two rows" although you probable meant columns) edge descriptions and you want to produce an adjacency matrix from the list of edges? How many actual vertices are there in the graph? If its 450,000 vertices then you are talking about a matrix with over 200 billion cells! – Judge Maygarden Apr 14 '10 at 13:29
  • @Andreas An adjacency matrix with 450k vertices would occupy closer to 1.5 TB using doubles. It would be more efficient to use a single bit per edge, but that would still take around 24 GB. – Judge Maygarden Apr 14 '10 at 13:30
  • @Judge: Yes, I thought that A A had problems storing a 450000 × 2 matrix of doubles, which would only occupy 7.2 MB. – Andreas Rejbrand Apr 14 '10 at 15:30
  • 1
    Please provide an example of how to process a small section of your data. The format of the adjacency list is not clear from your description. – Justin Peel Apr 14 '10 at 18:25
  • If you don't mind me asking, why do you need a full adjacency matrix in memory as opposed to the sparse representation of the source data? Is it the CAIDA file itself that you are having trouble reading into MatLab? – Judge Maygarden Apr 16 '10 at 19:23
  • defaultdict() would be a way to go - http://stackoverflow.com/questions/13547133/adjacency-list-and-adjacency-matrix-in-python – notablytipsy Aug 04 '13 at 08:16
  • What a nice example of a question who lives a healthy life which seems to be ignored by the OP... – heltonbiker Aug 26 '13 at 13:44

3 Answers3

1

If I understand your question correctly, then you require more memory than is available in RAM. Even with virtual memory, you probably can't allocate a block that big. Therefore, your solution is to write the adjacency matrix to a file as you build it. This method would work in MatLab or Python.


I am assuming you are processing CAIDA's Router-Level Topology Measurements since the format seems to match your description. Each line of this file contains an edge of the graph from one IP router (column 1) to another (column 2). A full adjacency matrix of the 192244 nodes would require 4.3 GB assuming you only use a single bit for each node. I would still suggest writing the matrix directly to a file instead of building it in memory.

Judge Maygarden
  • 26,961
  • 9
  • 82
  • 99
0

I would use defaultdict - it is simple to use, and just a few lines of code. I'm assuming your file looks like

a b
c d

First, put it into a list (http://docs.python.org/2/library/fileinput.html) so that the format is [(a, b),(c,d)].

Then, use defaultdict:

from collections import defaultdict

adjmat = defaultdict(int)
for edge in list:
    adjmat[edge] = 1

adjmat[a, b] will return 1 if the edge exists, 0 otherwise. If you can have multiple edges between nodes, you only need to change that to adjmat[edge] += 1, and adjmat[a, b] will return the number of edges connects a and b

notablytipsy
  • 387
  • 1
  • 4
  • 19
0

The simplest way? Well, if you have over 10,000 nodes, but only 45000 edges, use SciPy's sparse matrix:

http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7

SciPy provides various compression methods to keep the actual in-memory size of the matrix down (since the matrix values will largely be 0's). I'm sure MatLab provides a space-conscious sparse matrix data structure as well.

If you want to just know how to read in the file, I'd suggest you save it as a CSV or a text file (there is no real benefit in storing the data in an Excel file). Python comes w/ a library for reading/writing CSV files:

http://docs.python.org/library/csv.html

If you really want to use XLS files, then you can use either pyExcelerator (I've never used this) - http://sourceforge.net/projects/pyexcelerator/ - or you can use OpenOffice.org + PyUNO or MS Office + COM.

tixxit
  • 3,942
  • 2
  • 14
  • 9
  • If my understanding is correct, the file is just pairs of nodes (labeled with integers). You just create a sparse matrix, read the pairs in, and fill out the appropriate cell in the matrix with 1 for each pair. – tixxit Apr 16 '10 at 18:49
  • pyExcelerator is not maintained; use xlrd/xlwt for reading/writing XLS files. – John Machin May 15 '10 at 23:20