0

Suppose I have a sparse matrix of document collection, where each row is a vector representing a document (generated by scikit-learn's tfidf_transformer for example).

tfidf_matrix = tfidf_transformer.fit_transform(posting)

Now I have a query coming in,

query = transformer.transform(vectorizer.transform(['I am a sample query']))

So I want to compare this query, to each of the document (each row) of the matrix using scipy.spatial.distance.cosine (cosine similarity). So I do a map as follows

result = map(lambda document: cosine(document.toarray(), query[0].toarray()), tfidf_matrix)

it could be done with a loop as well

result = []
for row in tfidf_matrix:
    result = result + [cosine(row.toarray(), query[0].toarray())]

However, it is slow (I threw in a gevent.threadpool.map to it out of frustration with same result). I am pretty sure this is not the right way of doing this (mapping a function to each row of a sparse matrix), but I can't seem to find the proper way of doing this.

So the question is, what is the proper way to map a function to each row in the sparse matrix (scipy.csr_matrix)?

Jeffrey04
  • 6,138
  • 12
  • 45
  • 68
  • just a note, for this particular case (cosine similarity), this is the best way to do it http://stackoverflow.com/a/18914884/5742 – Jeffrey04 Oct 12 '15 at 10:13

1 Answers1

1

First thing I noticed was that you're running query[0].toarray() every time you go through the for loop (or on every iteration of the map() call). Is that value ever going to change in between rows? Because if it isn't, you can save some time by calculating it just one, outside the for loop:

result = []
query_array = query[0].toarray()
for row in tfidf_matrix:
    result = result + [cosine(row.toarray(), query_array)]

Also, don't do result = result + [another_list_element]; that's much slower than result.append(another_list_element). In this case, you should be doing:

result = []
query_array = query[0].toarray()
for row in tfidf_matrix:
    result.append(cosine(row.toarray(), query_array))

Or with map, that would be:

query_array = query[0].toarray()
result = map(lambda document: cosine(document.toarray(), query_array), tfidf_matrix)

There may be other speedups possible as well, but try this one and see if it helps.

EDIT: Also, have you seen Function application over numpy's matrix row/column? It looks like the vectorize function may be what you want. I can't give you more details since I'm not really familiar with numpy and scipy myself, but that looks like a good starting point for your reading.

Community
  • 1
  • 1
rmunn
  • 34,942
  • 10
  • 74
  • 105
  • the reason I needed toarray() is because cosine(u, v) would throw "ValueError: dimension mismatch" (though both should be in the same dimension since both are result returned by the tfidf_transformer). – Jeffrey04 Oct 12 '15 at 09:35
  • But do you need to run `toarray()` every time through the loop, or will calculating it once outside the loop be enough? – rmunn Oct 12 '15 at 09:35
  • yea, i could do query[0].toarray() outside the loop (: thanks for reminding. I am just curious whether there's a easier (and possibly faster) way to do this – Jeffrey04 Oct 12 '15 at 09:37