6

I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory exception . Is there a way of splitting the file into smaller ones and build a model for the data present in all the files combined?? thanks in advance

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
fayaz
  • 61
  • 2

5 Answers5

6

In bin/mallet.bat increase value for this line:

set MALLET_MEMORY=1G
metdos
  • 13,411
  • 17
  • 77
  • 120
1

I'm not sure about scalability of Mallet to big data, but project http://dragon.ischool.drexel.edu/ can store its data in disk backed persistence therefore can scale to unlimited corpus sizes(with low performance of course)

yura
  • 14,489
  • 21
  • 77
  • 126
  • 1
    It looks like the Dragon Toolkit is dead though. There hasn't been any activity since 2007. Moreover, it's not clear what license it uses (commercial development permissible?) – chaostheory May 18 '11 at 14:00
1

The model is still going to be pretty much huge, even if it read it from multiple files. Have you tried increasing the heap size of your java vm?

Turnsole
  • 3,422
  • 5
  • 30
  • 52
1

java.lang.outofmemory exception occurs mainly because of insufficient heap space. You can use -Xms and -Xmx to set heap space so that it will not come again.

Kiran M
  • 13
  • 6
0

Given the current PC's memory size, it should be easy to use a heap as large as 2GB. You should try the single-machine solution before considering using a cluster.

Leo5188
  • 1,967
  • 2
  • 17
  • 21