I was wondering how results of different packages, hence, algorithms, differ and if parameters could be set in a way to produce similar topics. I had a look at the packages text2vec
and topicmodels
in particular.
I used below code to compare 10 topics (see code section for terms) generated with these packages. I could not manage to generate sets of topics with similar meaning. E.g. topic 10 from text2vec
has something to do with "police", none of the topics produced by topicmodels
refers to "police" or similar terms. Further, I could not identify a pendant of topic 5 produced by topicmodels
that has something to do with "life-love-familiy-war" in the topics produced by text2vec
.
I am a beginner with LDA, hence, my understanding may sound naive for experienced programmers. However, intuitively, one would asssume that it should be possible to produce sets of topics with similar meaning to prove validity/robustness of results. Of course, not necessarily the exact same set of terms, but termlists addressing similar topics.
Maybe the issue is simply that my human interpretation of these termlists is not good enough to capture similarities, but maybe there are some parameters that might increase similarity for human interpretation. Can someone guide me on how to set parameters to achieve this or otherwise provide explanations or hint on suitable resources to improve my understanding of the matter?
Here some issues that might be relevant:
- I know that
text2vec
does not use standard Gibbs sampling but WarpLDA, which already is a difference in the algorithm totopcimodels
. If my understanding is correct, the priorsalpha
anddelta
used intopicmodels
are set asdoc_topic_prior
andtopic_word_prior
intext2vec
respectively. - Furthermore, in postprocessing, text2vec allows the adaption of
lambda
for sorting terms of topics based on their frequency. I have not yet understood, how terms are sorted intopicmodels
- comparable to settinglambda=1
?. (I have tried different lambdas between 0 to 1 without getting similar topics) - Another issue is that is seems difficult to produce a fully reproducible example even when setting
seed
(see, e.g., this question). This is not directly my question but might make it more difficult to respond.
Sorry for the lenghty question and thanks in advance for any help or suggestions.
Update2: I have moved the content of my first update into an answer that is based on a more complete analysis.
Update: Following the helpful comment of text2vec
package creator Dmitriy Selivanov, I can confirm that setting lambda=1
increases the similarity of topics betweeen the termlists produced by the two packages.
Furthermore, I had a closer look at the differences between termlists produced by both packages via a quick check of length(setdiff())
and length(intersect())
across topics (see in below code). This rough check shows that text2vec
discards several terms per topic - probably by a threshold of probability for the individual topics? topicmodels
keeps all terms for all topics. This explains part of the differences in meanings that can be derived (by a human) from the termlists.
As mentioned above already, generating a reproducible example seems difficult, so I have not adapted all data examples in below code. Since run time is short, anybody can check on his/her own system.
library(text2vec)
library(topicmodels)
library(slam) #to convert dtm to simple triplet matrix for topicmodels
ntopics <- 10
alphaprior <- 0.1
deltaprior <- 0.001
niter <- 1000
convtol <- 0.001
set.seed(0) #for text2vec
seedpar <- 0 #for topicmodels
#Generate document term matrix with text2vec
tokens = movie_review$review[1:1000] %>%
tolower %>%
word_tokenizer
it = itoken(tokens, ids = movie_review$id[1:1000], progressbar = FALSE)
vocab = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
#LDA model with text2vec
lda_model = text2vec::LDA$new(n_topics = ntopics
,doc_topic_prior = alphaprior
,topic_word_prior = deltaprior
)
doc_topic_distr = lda_model$fit_transform(x = dtm
,n_iter = niter
,convergence_tol = convtol
,n_check_convergence = 25
,progressbar = FALSE
)
#LDA model with topicmodels
ldatopicmodels <- LDA(as.simple_triplet_matrix(dtm), k = ntopics, method = "Gibbs",
LDA_Gibbscontrol = list(burnin = 100
,delta = deltaprior
,alpha = alphaprior
,iter = niter
,keep = 50
,tol = convtol
,seed = seedpar
,initialize = "seeded"
)
)
#show top 15 words
lda_model$get_top_words(n = 10, topic_number = c(1:10), lambda = 0.3)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "finally" "men" "know" "video" "10" "king" "five" "our" "child" "cop"
# [2,] "re" "always" "ve" "1" "doesn" "match" "atmosphere" "husband" "later" "themselves"
# [3,] "three" "lost" "got" "head" "zombie" "lee" "mr" "comedy" "parents" "mary"
# [4,] "m" "team" "say" "girls" "message" "song" "de" "seem" "sexual" "average"
# [5,] "gay" "here" "d" "camera" "start" "musical" "may" "man" "murder" "scenes"
# [6,] "kids" "within" "funny" "kill" "3" "four" "especially" "problem" "tale" "police"
# [7,] "sort" "score" "want" "stupid" "zombies" "dance" "quality" "friends" "television" "appears"
# [8,] "few" "thriller" "movies" "talking" "movies" "action" "public" "given" "okay" "trying"
# [9,] "bit" "surprise" "let" "hard" "ask" "fun" "events" "crime" "cover" "waiting"
# [10,] "hot" "own" "thinking" "horrible" "won" "tony" "u" "special" "stan" "lewis"
# [11,] "die" "political" "nice" "stay" "open" "twist" "kelly" "through" "uses" "imdb"
# [12,] "credits" "success" "never" "back" "davis" "killer" "novel" "world" "order" "candy"
# [13,] "two" "does" "bunch" "didn" "completely" "ending" "copy" "show" "strange" "name"
# [14,] "otherwise" "beauty" "hilarious" "room" "love" "dancing" "japanese" "new" "female" "low"
# [15,] "need" "brilliant" "lot" "minutes" "away" "convincing" "far" "mostly" "girl" "killing"
terms(ldatopicmodels, 10)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] "show" "where" "horror" "did" "life" "such" "m" "films" "man" "seen"
# [2,] "years" "minutes" "pretty" "10" "young" "character" "something" "music" "new" "movies"
# [3,] "old" "gets" "best" "now" "through" "while" "re" "actors" "two" "plot"
# [4,] "every" "guy" "ending" "why" "love" "those" "going" "role" "though" "better"
# [5,] "series" "another" "bit" "saw" "woman" "does" "things" "performance" "big" "worst"
# [6,] "funny" "around" "quite" "didn" "us" "seems" "want" "between" "back" "interesting"
# [7,] "comedy" "nothing" "little" "say" "real" "book" "thing" "love" "action" "your"
# [8,] "again" "down" "actually" "thought" "our" "may" "know" "play" "shot" "money"
# [9,] "tv" "take" "house" "still" "war" "work" "ve" "line" "together" "hard"
# [10,] "watching" "these" "however" "end" "father" "far" "here" "actor" "against" "poor"
# [11,] "cast" "fun" "cast" "got" "find" "scenes" "doesn" "star" "title" "least"
# [12,] "long" "night" "entertaining" "2" "human" "both" "look" "never" "go" "say"
# [13,] "through" "scene" "must" "am" "shows" "yet" "isn" "played" "city" "director"
# [14,] "once" "back" "each" "done" "family" "audience" "anything" "hollywood" "came" "probably"
# [15,] "watched" "dead" "makes" "3" "mother" "almost" "enough" "always" "match" "video"
#UPDATE
#number of terms in each model is the same
length(ldatopicmodels@terms)
# [1] 2170
nrow(vocab)
# [1] 2170
#number of NA entries for termlist of first topic differs
sum(is.na(
lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)[,1]
)
)
#[1] 1778
sum(is.na(
terms(ldatopicmodels, length(ldatopicmodels@terms))
)
)
#[1] 0
#function to check number of terms that differ between two sets of topic collections (excluding NAs)
lengthsetdiff <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(setdiff(i[!is.na(i)],j[!is.na(j)]))
})
})
}
#apply the check
termstopicmodels <- terms(ldatopicmodels,length(ldatopicmodels@terms))
termstext2vec <- lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)
lengthsetdiff(termstopicmodels,
termstopicmodels)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0
lengthsetdiff(termstext2vec,
termstext2vec)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 340 318 335 292 309 320 355 294 322
# [2,] 355 0 321 343 292 319 311 346 302 339
# [3,] 350 338 0 316 286 309 311 358 318 322
# [4,] 346 339 295 0 297 310 301 335 309 332
# [5,] 345 330 307 339 0 310 310 354 309 333
# [6,] 350 345 318 340 298 0 311 342 308 325
# [7,] 366 342 325 336 303 316 0 364 311 325
# [8,] 355 331 326 324 301 301 318 0 311 335
# [9,] 336 329 328 340 298 309 307 353 0 314
# [10,] 342 344 310 341 300 304 299 355 292 0
lengthsetdiff(termstopicmodels,
termstext2vec)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] 1778 1778 1778 1778 1778 1778 1778 1778 1778 1778
# [2,] 1793 1793 1793 1793 1793 1793 1793 1793 1793 1793
# [3,] 1810 1810 1810 1810 1810 1810 1810 1810 1810 1810
# [4,] 1789 1789 1789 1789 1789 1789 1789 1789 1789 1789
# [5,] 1831 1831 1831 1831 1831 1831 1831 1831 1831 1831
# [6,] 1819 1819 1819 1819 1819 1819 1819 1819 1819 1819
# [7,] 1824 1824 1824 1824 1824 1824 1824 1824 1824 1824
# [8,] 1778 1778 1778 1778 1778 1778 1778 1778 1778 1778
# [9,] 1820 1820 1820 1820 1820 1820 1820 1820 1820 1820
# [10,] 1798 1798 1798 1798 1798 1798 1798 1798 1798 1798
lengthsetdiff(termstext2vec,
termstopicmodels)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0
#also the intersection can be checked between the two sets
lengthintersect <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(intersect(i[!is.na(i)], j[!is.na(j)]))
})
})
}
lengthintersect(termstopicmodels,
termstext2vec)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] 392 392 392 392 392 392 392 392 392 392
# [2,] 377 377 377 377 377 377 377 377 377 377
# [3,] 360 360 360 360 360 360 360 360 360 360
# [4,] 381 381 381 381 381 381 381 381 381 381
# [5,] 339 339 339 339 339 339 339 339 339 339
# [6,] 351 351 351 351 351 351 351 351 351 351
# [7,] 346 346 346 346 346 346 346 346 346 346
# [8,] 392 392 392 392 392 392 392 392 392 392
# [9,] 350 350 350 350 350 350 350 350 350 350
# [10,] 372 372 372 372 372 372 372 372 372 372