If you use CountVectorizer transformer in your pipeline you can recuperate the indexed vocabulary in this way:
String[] vocabulary= countVectorizerModel.vocabulary();
Then, you run LDA over the SparseVectors that you obtain from that text > (term counts) transformation.
When looking at LDA results,
Tuple2<int[], double[]>[] topicsDescribed = ldaModel.describeTopics();
int idxTopic = 0;
for (Tuple2<int[], double[]> element : topicsDescribed) {
idxTopic++;
int[] termIndices = element._1;
double[] termScores = element._2;
System.out.println("Topic >> " + idxTopic);
for (int i = 0; i < termIndices.length; i++) {
System.out.println("termIndex --> " + termIndices[i] + + "word="+ vocabulary[termIndices[i]] + + ",score= " + termScores[i]);
}
}
}
This works because you keep consistent the vocabulary of terms throughout the pipeline, such as
ldaModel.vocabSize() == vocabulary.length