-1

I'm actually looking to speed up #2 of this code by as much as possible, so I thought that it might be useful to try Cython. However, I'm not sure how to implement sparse matrix in Cython. Can somebody show how to / if it's possible to wrap it in Cython or perhaps Julia to make it faster?

#1) This part computes u_dict dictionary filled with unique strings and then enumerates them.

import scipy.sparse as sp
import numpy as np
from scipy.sparse import csr_matrix

full_dict = set(train1.values.ravel().tolist() + test1.values.ravel().tolist() + train2.values.ravel().tolist() + test2.values.ravel().tolist())
print len(full_dict)
u_dict= dict()
for i, q in enumerate(full_dict):
    u_dict[q] = i


shape = (len(full_dict), len(full_dict))
H = sp.lil_matrix(shape, dtype=np.int8)


def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

#2) I need to speed up this part
# train_full is pandas dataframe with two columns w1 and w2 filled with strings

H = load_sparse_csr('matrix.npz')

correlation_train = []
for idx, row in train_full.iterrows():
    if idx%1000 == 0: print idx
    id_1 = u_dict[row['w1']]
    id_2 = u_dict[row['w2']]
    a_vec = H[id_1].toarray() # these vectors are of length of < 3 mil.
    b_vec = H[id_2].toarray()
    correlation_train.append(np.corrcoef(a_vec, b_vec)[0][1])
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Artem Korol
  • 191
  • 1
  • 2
  • 11
  • See [this question](http://stackoverflow.com/questions/25295159/how-to-properly-pass-a-scipy-sparse-csr-matrix-to-a-cython-function). – Ami Tavory May 20 '17 at 21:35

1 Answers1

0

While I contributed to How to properly pass a scipy.sparse CSR matrix to a cython function? quite some time ago, I doubt if cython is the way to go. Especially if you don't already have experience with numpy and cython. cython gives the biggest speedup when you replace iterative calculations with code that it can translate to C without calling numpy or other python code. Throw pandas into the mix and you have an even bigger learning curve.

And important parts of sparse code are already written with cython.

Without touching the cython issue I see a couple of problems.

H is defined twice:

H = sp.lil_matrix(shape, dtype=np.int8)
H = load_sparse_csr('matrix.npz')

That's either an oversight, or a failure to understand how Python variables are created and assigned. The 2nd assignment replaces the first; thus the first does nothing. In addition the first just makes an empty lil matrix. Such a matrix could be filled iteratively; while not fast it is the intended use of the lil format.

The 2nd expression creates a new matrix from data saved in an npz file. That involves the numpy npz file loaded as well as the basic csr matrix creation code. And since the attributes are already in csr format, there's nothing for cython touch.

You do have an iteration here - but over a Pandas dataframe:

for idx, row in train_full.iterrows():
    id_1 = u_dict[row['w1']]
    a_vec = H[id_1].toarray()

Looks like you are picking a particular row of H based on a dictionary/array look up. Sparse matrix indexing is slow compared to dense matrix indexing. That is, if Ha = H.toarray() fits your memory then,

a_vec = Ha[id_1,:]

will be a lot faster.

Faster selection of rows (or columns) from a sparse matrix has been asked before. If you could work directly with the sparse data of a row I could recommend something more direct. But you want a dense array that you can pass to np.corrcoef, so we'd have to implement the toarray step as well.

How to read/traverse/slice Scipy sparse matrices (LIL, CSR, COO, DOK) faster?

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353