I have following situation that I want to address using Python
(preferably using numpy
and scipy
):
- Collection of documents that I want to convert to a sparse term document matrix.
- Extract sparse vector representation of each document (i.e. a row in the matrix) and find out top 10 similary documents using cosine similarity within certain subset of documents (documents are labelled with categories and I want to find similar documents within the same category).
How do I achieve this in Python
? I know I can use scipy.sparse.coo_matrix
to represent documents as sparse vectors and take dot product to find cosine similarity, but how do I convert the entire corpus to a large but sparse term document matrix (so that I can also extract it's rows as scipy.sparse.coo_matrix
row vectors)?
Thanks.