3

I've been unable to create reproducible results from topicmodels' LDA function. To take an example from their documentation:

library(topicmodels)
set.seed(0)
lda1 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2)
set.seed(0)
lda2 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2)
identical(lda1, lda2)
# [1] FALSE

How can I get identical results from two separate calls to LDA?

As an aside (in case the package authors are on here), I find the control=list(seed=0) snippet unfortunate and unnecessary. Behind the scenes, there's a line for if (missing(seed)) seed <- as.integer(Sys.time()). This doesn't make the process more reliably random, it only undoes a specified seed. Am I missing something?

UPDATE: As @hrbrmstr discovered below, passing a seed as a control results in effectively identical objects, with the only difference being a temp local file location. So this question is more of a misunderstanding (though still seems like it would be clearer if the function respected set.seed()).

Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
  • `seed` is mentioned in conjunction with `nstart` and `best` on page 9 of the [topicmodels jstats journal entry](http://epub.wu.ac.at/3987/1/topicmodels.pdf) [PDF]. I think you may need to ensure all of those are in the `control=list(…)` parameter set to get fully reproducible results. – hrbrmstr Mar 25 '14 at 00:40
  • Adding nstart=1 and best=T to the control still didn't get identical(lda1, lda2)==T – Max Ghenis Mar 25 '14 at 00:46

1 Answers1

4

Not really an "answer" but there's no other way to post code snippets :-)

I gave the following a go:

library(topicmodels)

data(AssociatedPress)

lda1 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2)
lda2 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2)

identical(lda1, lda2)
[1] FALSE

all.equal(lda1, lda2)
[1] "Attributes: < Component 5: Attributes: < Component 10: 1 string mismatch > >"

a1 <- posterior(lda1, AssociatedPress)
a2 <- posterior(lda2, AssociatedPress)

identical(a1, a2)
[1] TRUE

all.equal(a1, a2)
[1] TRUE

all.equal(lda1@alpha,lda2@alpha)
[1] TRUE
all.equal(lda1@call,lda2@call)
[1] TRUE
all.equal(lda1@Dim,lda2@Dim)
[1] TRUE
all.equal(lda1@control,lda2@control)
[1] "Attributes: < Component 10: 1 string mismatch >"
all.equal(lda1@k,lda2@k)
[1] TRUE
all.equal(lda1@terms,lda2@terms)
[1] TRUE
all.equal(lda1@documents,lda2@documents)
[1] TRUE
all.equal(lda1@beta,lda2@beta)
[1] TRUE
all.equal(lda1@gamma,lda2@gamma)
[1] TRUE
all.equal(lda1@wordassignments,lda2@wordassignments)
[1] TRUE
all.equal(lda1@loglikelihood,lda2@loglikelihood)
[1] TRUE
all.equal(lda1@iter,lda2@iter)
[1] TRUE
all.equal(lda1@logLiks,lda2@logLiks)
[1] TRUE
all.equal(lda1@n,lda2@n)
[1] TRUE

identical(lda1@alpha,lda2@alpha)
[1] TRUE
identical(lda1@call,lda2@call)
[1] TRUE
identical(lda1@Dim,lda2@Dim)
[1] TRUE
identical(lda1@control,lda2@control)
[1] FALSE
identical(lda1@k,lda2@k)
[1] TRUE
identical(lda1@terms,lda2@terms)
[1] TRUE
identical(lda1@documents,lda2@documents)
[1] TRUE
identical(lda1@beta,lda2@beta)
[1] TRUE
identical(lda1@gamma,lda2@gamma)
[1] TRUE
identical(lda1@wordassignments,lda2@wordassignments)
[1] TRUE
identical(lda1@loglikelihood,lda2@loglikelihood)
[1] TRUE
identical(lda1@iter,lda2@iter)
[1] TRUE
identical(lda1@logLiks,lda2@logLiks)
[1] TRUE
identical(lda1@n,lda2@n)
[1] TRUE

Is the "unequal" @control significant?

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • Ah indeed, and within the `@control` slot it's only the `@prefix` slot that differs. This holds the tmp folder on the computer, presumably where intermediate data steps are performed. Thanks, marking as an answer and will edit the question accordingly. – Max Ghenis Mar 25 '14 at 01:35
  • Nice! I've been bit with other complex data structures and `identical` before so I hoped some further dissection would yield some fruit. Glad it helped. – hrbrmstr Mar 25 '14 at 01:36