the accuracy of LDA predict for new documents with Spark

Question

I'm work with Mllib of Spark, and now is doing something with LDA.

But when I use the code provided by Spark(see bellow) to predict a Doc used in training the model, the result(document-topics) of predict is at opposite poles with the result of trained document-topics.

I don't know what caused the result.

Asking for help, and here is my code below:

train:$lda.run(corpus) the corpus is an RDD like this: $RDD[(Long, Vector)] the Vector contains vocabulary, index of words, wordcounts. predict:

    def predict(documents: RDD[(Long, Vector)], ldaModel: LDAModel):        Array[(Long, Vector)] = {
    var docTopicsWeight = new Array[(Long, Vector)](documents.collect().length)
    ldaModel match {
      case localModel: LocalLDAModel =>
        docTopicsWeight = localModel.topicDistributions(documents).collect()
      case distModel: DistributedLDAModel =>
        docTopicsWeight = distModel.toLocal.topicDistributions(documents).collect()
    }
    docTopicsWeight
  }

I assume you're seeing this problem when `ldaModel` is a `DistributedLDAModel`. Is that correct? — Jason Scott Lenderman, Nov 05 '15 at 06:50

eliasah · Answer 1 · 2015-11-04T09:50:43.687

1

I'm not sure if your question actually concerns on why you were getting errors on your code but from I have understand, it seems first that you were using the default Vector class. Secondly, you can't use case class on the model directly. You'll need to use the isInstanceOf and asInstanceOf method for that.

def predict(documents: RDD[(Long, org.apache.spark.mllib.linalg.Vector)], ldaModel: LDAModel): Array[(Long, org.apache.spark.mllib.linalg.Vector)] = {

    var docTopicsWeight = new Array[(Long, org.apache.spark.mllib.linalg.Vector)](documents.collect().length)
    if (ldaModel.isInstanceOf[LocalLDAModel]) {
      docTopicsWeight = ldaModel.asInstanceOf[LocalLDAModel].topicDistributions(documents).collect
    } else if (ldaModel.isInstanceOf[DistributedLDAModel]) {
      docTopicsWeight = ldaModel.asInstanceOf[DistributedLDAModel].toLocal.topicDistributions(documents).collect
    }
    docTopicsWeight

}

edited Nov 04 '15 at 09:50

answered Nov 04 '15 at 09:43

eliasah

39,588
11
124
154

your code is just modified some format and it seems not effective for solving my question. Could you please add some explanation? – Carlos Nov 04 '15 at 09:50
It's not just about formatting. Plus, you question actually isn't clear. – eliasah Nov 04 '15 at 09:52
Thank's for your help. I'll try you code. And my question is not so clear because I'm the first time to ask question on stack overflow. When I use a doc in training(It has a doc-topics result. For example, a topic, 5th, have a weight of 0.8) as input of predicting, the predicted result has an almost balanced topic weight(maybe the 5th topic's weight is 0.1) – Carlos Nov 04 '15 at 10:05
I'm sorry Carlos, I still don't understand the problem. Try to divide it into input ,ouput and expected output – eliasah Nov 04 '15 at 12:38
Well, could you please tell me how did you do predict with LDA model on spark? – Carlos Nov 05 '15 at 05:42
1

@Carlos, I think the way you're doing prediction is OK (except you don't really need the `var`, reassignment, etc. in there...) I suspect the discrepancies your seeing have something to do with issues in the variational inference of the `LocalLDAModel`. What happens if you call `predict` several times? Do you get (significantly) different topic weights for the documents each time? Do any of these look more reasonable? – Jason Scott Lenderman Nov 05 '15 at 07:32
Jason seems to have understood your issue so far. But the problems isn't in the var declaration. – eliasah Nov 05 '15 at 07:59

the accuracy of LDA predict for new documents with Spark

1 Answers1

Linked