Mallet topic modelling

Question

I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory exception . Is there a way of splitting the file into smaller ones and build a model for the data present in all the files combined?? thanks in advance

score 6 · Answer 1 · answered Nov 04 '12 at 21:36

6

In bin/mallet.bat increase value for this line:

set MALLET_MEMORY=1G

answered Nov 04 '12 at 21:36

metdos

13,411
17
77
120

score 1 · Answer 2 · answered Mar 02 '11 at 19:48

1

I'm not sure about scalability of Mallet to big data, but project http://dragon.ischool.drexel.edu/ can store its data in disk backed persistence therefore can scale to unlimited corpus sizes(with low performance of course)

answered Mar 02 '11 at 19:48

yura

14,489
21
77
126

1

It looks like the Dragon Toolkit is dead though. There hasn't been any activity since 2007. Moreover, it's not clear what license it uses (commercial development permissible?) – chaostheory May 18 '11 at 14:00

score 1 · Answer 3 · answered Mar 02 '11 at 20:17

1

The model is still going to be pretty much huge, even if it read it from multiple files. Have you tried increasing the heap size of your java vm?

answered Mar 02 '11 at 20:17

Turnsole

3,422
5
30
52

score 1 · Answer 4 · answered Jan 09 '12 at 12:07

1

java.lang.outofmemory exception occurs mainly because of insufficient heap space. You can use -Xms and -Xmx to set heap space so that it will not come again.

answered Jan 09 '12 at 12:07

Kiran M

13
6

score 0 · Answer 5 · answered Mar 06 '11 at 14:48

0

Given the current PC's memory size, it should be easy to use a heap as large as 2GB. You should try the single-machine solution before considering using a cluster.

answered Mar 06 '11 at 14:48

Leo5188

1,967
2
17
21

Mallet topic modelling

5 Answers5

Linked