3

I'm using Spark's LDA implementation, as shown in example code here. I want to get consistent topics/topic distributions for my training data. I'm training on two machines and would like the output to be the same.

I understand that LDA uses a random component for training/inference, it's explained in this SO post. It looks like consistent results can be achieved in python gensim by setting the seed value manually. I've tried this in Spark but I am still getting slight variance in my outputted topic distributions.

ldaParams: LDA = new LDA().setK(10).setMaxIterations(60).setSeed(10L)
val distributedLDAModel: DistributedLDAModel = ldaParams.run(corpusInfo.docTermVectors).asInstanceOf[DistributedLDAModel]
val topicDistributions: Map[Long,Vector] = distributedLDAModel.topicDistributions.collect.toMap //produces different results on each run

Is there a way I can get consistent topic distributions for my training set of data?

Community
  • 1
  • 1
alex9311
  • 1,230
  • 1
  • 18
  • 42
  • 1
    What do you mean by _slight_? Can you give estimated order of magnitude. – zero323 Jun 18 '16 at 11:23
  • I could try to get an estimate, but why does that matter? I'm looking for consistent results. Is that even possible? – alex9311 Jun 19 '16 at 08:33
  • 1
    Well, it matters because we work with this ugly thing called FP arithmetics where operations which we expect to be associative and commutative are not. Since every shuffle in Spark introduces a source of non-determinism there is no strong guarantee you'll get exactly the same result with each run even if RNG sequence is deterministic. – zero323 Jun 19 '16 at 12:34

0 Answers0