1

I am trying to create a LDA model using Apache Spark ML in Java. The input documents are in String format. I get the topics in number format but not in word format.

Found a similar question, but sadly the solution is in R- LDA with topicmodels, how can I see which topics different documents belong to? But I am looking for solutions using the ML lib of Spark in Java.

Any help would be much appreciated. Thanks!

Community
  • 1
  • 1
minie
  • 21
  • 3

1 Answers1

1

If you use CountVectorizer transformer in your pipeline you can recuperate the indexed vocabulary in this way:

String[] vocabulary= countVectorizerModel.vocabulary();

Then, you run LDA over the SparseVectors that you obtain from that text > (term counts) transformation.

When looking at LDA results,

Tuple2<int[], double[]>[] topicsDescribed = ldaModel.describeTopics();

int idxTopic = 0;
for (Tuple2<int[], double[]> element : topicsDescribed) {

    idxTopic++;
    int[] termIndices = element._1;
    double[] termScores = element._2;

    System.out.println("Topic >> " + idxTopic);
    for (int i = 0; i < termIndices.length; i++) {
            System.out.println("termIndex --> " + termIndices[i] + + "word="+ vocabulary[termIndices[i]] +  + ",score= " + termScores[i]);
        }
    }
}

This works because you keep consistent the vocabulary of terms throughout the pipeline, such as

ldaModel.vocabSize() == vocabulary.length
marilena.oita
  • 919
  • 8
  • 13