0

I'm performing an NMF decomposition on a tf-idf input in order to perform topic analysis.

def decomp(tfidfm, topic_count):
    model = decomposition.NMF(init="nndsvd", n_components=topic_count,     max_iter=500)
    H = model.fit_transform(tfidfm)
    W = model.components_
    return W, H

This returns W, a model definition consisting of topics to term assignments, and H, a document to topic assignment matrix

So far so good, I can use H to classify documents based on their association via term frequency to a list of topics which in turn are also based on their association to term frequency.

I'd like to save the topic-term-associations to disk so I can reapply them later - and have adopted the method described here [https://stackoverflow.com/questions/8955448] to store the sparse-matrix reperesentation of W.

So what I'd like to do now, is perform the same process, only fixing the topic-definition matrix W.

In the documentation, it appears that I can set W in the calling parameters something along the lines of:

def applyModel(tfidfm,W,topic_count):
    model = decomposition.NMF(init="nndsvd", n_components=topic_count, max_iter=500)
    H = model.fit_transform(X=tfidfm, W=W)
    W = model.components_
    return W, H

And I've tried this, but it doesn't appear to work.

I've tested by compiling a W matrix using a differently sized vocabulary, then feeding that into the applyModel function, the shape of the resulting matrices should be defined (or I should say, that is what I'm intending) by the W model, but this isn't the case.

The short version of this question is: How can I save the topic-model generated from a matrix decomposition, such that I can use it to classify a different document set than the one used to originally generate it?

In other terms, if V=WH, then how can I return H, given V and W?

Community
  • 1
  • 1
Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42

2 Answers2

1

The initial equation is: initial equation and we solve it for H like this: How to solve it for H.

Here inverse of W denotes the inverse of the matrix W, which exists only if W is nonsingular.

The multiplication order is, as always, important. If you had if the order is changed, you'd need to multiply V by the inverse of W the other way round: no description.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
  • Of course! Maths wins again. I'll post the solution I used to perform the matrix multiplication/inverse to get H. It looks as though I'm getting meaningful results for the application I'm applying it to. I will mark as answered shortly but would like to leave open to invite any additional answers - I was anticipating something baked into scikit and don't want to replicate a process if there's already something there. – Thomas Kimber Oct 17 '16 at 19:46
0

For completeness, here's the rewritten applyModel function that takes into account the answer from ForceBru (uses an import of scipy.sparse.linalg)

def applyModel(tfidfm,W):
    H = tfidfm * linalg.inv(W)
    return H

This returns (assuming an aligned vocabulary) a mapping of documents to topics H based on a pregenerated topic-model W and document feature matrix V generated by tfidf.

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42