1

I have a question on "augment" function from Silge and Robinson's "Text Mining with R: A Tidy Approach" textbook. Having run an LDA on a corpus, I am applying the "augment" to assign topics to each word.

I get the results, but am not sure what takes place "under the hood" behind "augment", i.e. how the topic for each word is being determined using the Bayesian framework. Is it just based on conditional probability formula, and estimated after LDA is fit using p(topic|word)=p(word|topic)*p(topic)/p(word)?

I will appreciate if someone could please provide statistical details on how "augment" does this. Could you also please provide references to papers where this is documented.

James Z
  • 12,209
  • 10
  • 24
  • 44
Dave
  • 329
  • 2
  • 10
  • 3
    Is your question about augment() or about LDA? Augment is simply "tidying" the output from the model fit. If your question is about the details surrounding topic modeling using LDA, I believe stats.stackexchange is a better place – Henry Cyranka Nov 16 '18 at 15:30
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. `augment()` is a generic function from `broom`. It does different things depending on what you pass in and it not at all specific to topic modeling. You need to check the documentation for the other functions you are using to determine what method is used during modeling. – MrFlick Nov 16 '18 at 16:15
  • 1
    @Harro Cyranka, my question is about how "augment" function determines topics for each word. Hence, not about LDA per se. Apparently, as part of "tydying" the output it returns topics. I need to understand how it gets those topics. – Dave Nov 16 '18 at 17:50
  • The augment function does not assign words to any topic. As @MrFlick commented, this is a generic function from broom. Documents are assigned to topics and words are assigned to topics by the function LDA(). Augment is simply matching the words assignment created by LDA to each word in the tidy dataframe. – Henry Cyranka Nov 16 '18 at 17:56

1 Answers1

1

The tidytext package is open source and on GitHub so you can dig into the code for augment() for yourself. I'd suggest looking at

  • augment() for LDA from the topicmodels package
  • augment() for the structural topic model from the stm package

To learn more about these approaches, there is an excellent paper/vignette on the structural topic model, and I like the Wikipedia article for LDA.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48