Topic Modelling in MALLET vs NLTK

Question

I just read a fascinating article about how MALLET could be used for topic modelling, but I couldn't find anything online comparing MALLET to NLTK, which I've already had some experience with.

What are the main differences between them? Is MALLET a more 'complete' resource (e.g. has more tools and algorithms under the hood)? Or where are some good articles answering these first two questions?

Can't answer that, but the NLTK includes a mallet interface so you can try them out in tandem. — alexis, Apr 09 '12 at 22:36
If you're already familiar with Python, just use "gensim, topic modelling for humans". — Radim, Mar 21 '14 at 10:20
@Radim ;P yes `gensim` is one of the most user-friendly topic modelling module i've personally used/seen for python. It should have been "gensim, topic modelling for mere mortals" =) — alvas, Apr 09 '14 at 07:54
As far as I can tell, nltk doesn't have a topic modelling implementation, at least not a Latent Dirichlet Allocation model (which is what Mallet provides). I'd be interested to know if I'm wrong :) — drevicko, Feb 08 '16 at 17:47

score 25 · Accepted Answer · edited Nov 18 '14 at 15:54

It's not that one is more complete than the other it is more a question of one having some stuff the other doesn't and vice versa. It also a question of intended audience and purpose.

Mallet is a Java based machine learning toolkit that aims to provide robust and fast implementations for various natural language processing tasks.

NLTK is built using Python and comes with a lot of extra stuff like corpora such as WordNet. NLTK is aimed more at people learning NLP, and as such is used more as a learning platform and perhaps less as an engineering solution.

In my opinion the main difference between the two is that NLTK is better positioned as a learning resource for people interested in machine learning and NLP as it comes with a whole ton of documentation, examples, corpora etc. etc.

Mallet is more aimed at researchers and practitioners that work in the field and already know what they want to do. It comes with less documentation (although it has good examples and the API is well documented) compared to NLTK's extensive collection of general NLP stuff.

UPDATE: Good articles describing these would be the Mallet docs and examples at http://mallet.cs.umass.edu/ - the sidebar has links to sequence tagging, topic modelling etc.

and for NLTK the NLTK book Natural Language Processing with Python is a good introduction both to NLTK and to NLP.

UPDATE

I've recently found the sklearn Python library. This is aimed at machine learning more generally, not directly for NLP but can be used for that as well. It comes with a very large selection of modelling tools and most of it seems to rely on NumPy so it should be pretty fast. I've used it quite a bit and can say that it is very well written and documented and has an active developer community pushing it forward (as of May 2013 at least).

UPDATE 2

I've now also been using mallet for some time (specifically the mallet API) and can say that if you're planning on integrating mallet into another project you should be very familiar with Java and ready to spend a lot of time debugging an almost completely undocumented code base.

If all you want to do is to use the mallet command line tools, that's fine, using the API requires a lot of digging through the mallet code itself and usually fixing some bugs as well. Be warned mallet comes with minimal documentation with regards to the API.

score 2 · Answer 2 · answered Apr 07 '14 at 22:48

2

The question is whether you're working in Python or Java (or none of the above). Mallet is good for Java (therefore Clojure and Scala) since you can easily access it's API in Java. Mallet also has a nice commandline interface so you can use it outside of an application.

For the same reason with Python, NLTK is great for python, and you won't have to do any Jython craziness to get these to play well together. If you're using python, Gensim just added a Mallet wrapper that is worth checking out. Right now, it's basically a bare-bones alpha feature, but it may do what you need.

answered Apr 07 '14 at 22:48

theclaymethod

197
1
3

2

If you're using gensim you might as well go with the online lda version in `gensim.models.ldamodel.LdaModel` instead of the mallet one, unless you really really want to use the Gibbs sampling variety that mallet implements. – Matti Lyra Feb 05 '15 at 09:23
1

@MattiLyra well... the mallet LDA implementations do parameter optimisaion. In most cases that profoundly improves the quality of the learned model. If you don't need an online algorithm (ie: you don't need to keep adding documents and refining the model), I'd go for mallet. I've not compared convergence speeds, but mallet's also got a multicore estimator, which speeds things up a lot if you've got the hardware. – drevicko Apr 15 '15 at 06:15

score 2 · Answer 3 · answered Mar 12 '15 at 09:33

I'm not familiar with NLTK's topic modeling toolkit, so I won't try to compare it. The Mallet sources in Github contain several algorithms (some of which are not available in the 'released' version). To my knowledge, there are

SimpleLDA (LDA with collapsed Gibbs sampling)
ParallelTopicModel (LDA that works on multi-core)
HierarchicalLDA
LabeledLDA (a semi-supervised approach to LDA)
Pachinko Allocation with LDA.
WeightedTopicModel

It also has

a couple of classes that help in diagnosis of LDA models. (TopicModelDiagnostics.java)
The ability to serialize and de-serialize a trained LDA model.

All in all, it is a fine toolkit for experimenting with topic models, with a approachable open-source license (CPL).

Topic Modelling in MALLET vs NLTK

3 Answers3