4

What's a good Python library for manipulating very large matrices (e.g. millions of rows/columns), including the ability to add rows or columns at any stage of the matrix's life?

I had looked at pytables and h5py, but neither support adding or removing rows or columns once the matrix is created.

The only other thing I could find was the sparse matrix functionality in numpy/scipy noted in these questions. However, the ability to add/remove rows and columns seems possible but officially unsupported and a bit hacky, so I'm fearing the performance would be horrible with a real dataset. Also, it includes several different sparse matrix implementations, so I'm confused which one would be best (e.g. lil_matrix vs csc_matrix vs csr_matrix).

Community
  • 1
  • 1
Cerin
  • 60,957
  • 96
  • 316
  • 522

1 Answers1

2

If your matrix is sparse you can add or remove rows or columns without hackying with scipy.sparse. If you want to remove columns (do column slicing) you should go for csc_matrix, while the csr_matrix should be used for efficient row slicing. Usually it is convenient to create the sparse matrix using the coo_matrix type, where you can specify the row, col and data for each non-zero entry:

m = coo_matrix((data, (row, col)), shape=(nrow, ncol))
m = m.to_csr()[rows_to_keep, :]
m = m.to_csc()[:, cols_to_keep]

where rows_to_keep can be a list or a 1-D array with the indices to keep.

If you need a dense matrix you can use perhaps the numpy.memmap() array. To create one you can do:

a = np.memmap('test.memmap', dtype='float64', mode='w+', shape=(1000, 1000))
a.fill(100.)

To read one you can do:

a = np.memmap('a.memmap', dtype='float64', mode='r+', shape=(1000, 1000))

If you want to remove or add rows and columns you have to create a second memmap array and then assign the columns that you want from the original one:

b = np.memmap('b.memmap', dtype='float64', mode='w+', shape=(3, 1000))
b = a[[0, 99, 199], :]

this will save in b the first, 100th and 200th rows of a, with all the columns.

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
  • Thanks, but I'm getting `TypeError: 'coo_matrix' object does not support indexing`. It seems strange to me that any matrix type couldn't be indexed, since that's the whole purpose of a matrix... I'll assume that is explained in the scipy docs, but http://docs.scipy.org has been offline the last couple days. – Cerin Apr 26 '14 at 02:24
  • @Cerin yes, you have to convert before using `to_csr()` or `to_csc()`, then the indexing should work... – Saullo G. P. Castro Apr 26 '14 at 06:04
  • @Cerin I believe the purpose of the `coo_matrix` is to provide one type of sparse matrix which is easy to populate ans fast to convert to the other types (`csr_matrix` or `csc_matrix` for example) – Saullo G. P. Castro Apr 26 '14 at 07:24