2

Step 1

I'm using R and the "topicmodels" package to build a LDA model from a 4.5k documents corpus. I do the usual pre-processing steps (stopwords, cut low/high words frequencies, lemmatization) and end up with a 100 topics model that I'm happy with. In fact, it's an almost perfect model for my needs.

justlda <- LDA(k=100, x=dtm_lemma, method="Gibbs", control=control_list_gibbs)

Step 2

I then pre-process using the same exact process as above a new (unseen by the model) 300 documents corpus, then transform it into a document-term matrix, then use the "posterior" function of the same package to predict the topics on the new data. This corpus is coming from the same authors and is very similar to the training set.

My problem

The predictions (posterior probabilities) I get are totally wrong. This is the code I'm using to get the posterior:

topics = posterior(justlda, dtm_lemma, control = control_list_gibbs)$topics
  • justlda is the model built with the whole corpus in step 1.
  • dtm_lemma is the pre-processed document-term matrix of the new data.
  • control is lda parameters (same for both).

I feel that not only are the predictions wrong, the topics weights are very low. Nothing is coming out as a dominant topic. (For this 100 topics model, most topics come out as 0.08 and I'm lucky to get a 0.20 weight that is not even relevant...)

I got less than a year of experience with NLP/LDA and the R language. I feel I could be making a very amateur mistake somewhere that could explain the wrong predictions?

Is this kind of results normal? What could I be possibly doing wrong?

JFB
  • 23
  • 3

1 Answers1

1

I'm not 100% sure what you mean by 'wrong'. I did a quick test to see if posterior works on new data. First I run a model with all documents of the AssociatedPress dataset:

library(topicmodels)
data("AssociatedPress")
ap_lda <- LDA(AssociatedPress, k = 5, control = list(seed = 1234))

From your question, I suspect you are looking at most likely topics for each document here. To keep it comparable, I build my own way of finding these out here, based on some tidy packages:

library(tidytext)
library(dplyr)
library(tidyr)
ap_documents <- tidy(ap_lda, matrix = "gamma")
ap_documents %>% 
  group_by(document) %>% 
  top_n(1, gamma) %>% # keep only most likely topic
  arrange(document)
# A tibble: 2,246 x 3
# Groups:   document [2,246]
   document topic gamma
      <int> <int> <dbl>
 1        1     4 0.999
 2        2     2 0.529
 3        3     4 0.999
 4        4     4 0.518
 5        5     4 0.995
 6        6     2 0.971
 7        7     1 0.728
 8        8     2 0.941
 9        9     4 0.477
10       10     5 0.500
# ... with 2,236 more rows

Now I run the same LDA again but withhold the first 10 documents:

AssociatedPress_train <- AssociatedPress[11:nrow(AssociatedPress), ]
AssociatedPress_test <- AssociatedPress[1:10, ]

ap_lda <- LDA(AssociatedPress_train, k = 5, control = list(seed = 1234))

I use posterior to get the gamma values for each document and again just keep the most likely:

posterior(object = ap_lda, newdata = AssociatedPress_test)$topics %>%
  as_tibble() %>% 
  mutate(document = seq_len(nrow(.))) %>% 
  gather(topic, gamma, -document) %>% 
  group_by(document) %>% 
  top_n(1, gamma) %>% # keep only most probable topic
  arrange(document)
# A tibble: 10 x 3
# Groups:   document [10]
   document topic gamma
      <int> <chr> <dbl>
 1        1 4     0.898
 2        2 2     0.497
 3        3 4     0.896
 4        4 4     0.468
 5        5 4     0.870
 6        6 2     0.754
 7        7 1     0.509
 8        8 2     0.913
 9        9 4     0.476
10       10 2     0.399

All but document 10 have the same most likely topic as before. So everything seems to work fine! So I don't see an immediate problem with your code.

One thing I haven't tested is what happens if the DTM of training and test set have different columns. I suspect that would be a problem.

Here is a quick example of how you could deal with that:

text1 <- tibble(doc = 1, word = LETTERS[1:10])
text2 <- tibble(doc = 1, word = LETTERS[2:11])
dtm1 <- text1 %>%
  count(doc, word) %>%
  arrange(word) %>%
  cast_dtm(doc, word, n)

dtm2 <- text2 %>%
  count(doc, word) %>%
  arrange(word) %>%
  cast_dtm(doc, word, n)

all.equal(dtm1$dimnames$Terms, dtm2$dimnames$Terms)
[1] "10 string mismatches"

I make two DTMs where the second one has an extra term and lacks one term from the other. The dimnames are thus different. We can make them equal by a bringing the DTM back into a tidy format, remove the extra term and add the missing terms before casting the DTM again:

dtm2_clean <- tidy(dtm2) %>% 
  filter(term %in% dtm1$dimnames$Terms) %>% 
  rbind(tibble(document = 1, 
               term = dtm1$dimnames$Terms, # adding term but no counts
               count = 0)) %>% 
  arrange(term) %>% 
  cast_dtm(document, term, count)

all.equal(dtm1$dimnames$Terms, dtm2_clean$dimnames$Terms)
[1] TRUE

You can now use this as newdata for posterior.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • Thank you for your answer. You're confirming that I'm going at it the right way. And from your last line, I'm starting to feel it's the pre-processing code that could be wrong. I've added all of it in the question. – JFB Mar 18 '19 at 02:27
  • I think the goal has quite changed from your original question. You are new here, so you might not understand that this is considered bad style. I've already put in work and don't want to completely rethink your problem (I also don't have the time). I added some code for what I think is the most likely problem you have (differing terms in training and test DTM). Also, note that I arrange the terms before casting the DTM so they are both in alphabetical order. – JBGruber Mar 18 '19 at 10:32
  • The DTM not being created equal definitely seems to explain my problem. I accepted your answer. And sorry for moving the goalpost of the question, I'll keep that in mind moving forward. They should add a button to send a bottle of wine to the helper. You deserve it! – JFB Mar 18 '19 at 17:03
  • Haha don't worry about it. Glad I could help! – JBGruber Mar 18 '19 at 18:33