0

I have large datasets of 2-3 GB. I am using (nltk) Naive bayes classifier using the data as train data. When I run the code for small datasets, it runs fine but when run for large datasets it runs for a very long time(more than 8 hours) and then crashes without much of an error. I believe it is because of memory issue.

Also, after classifying the data I want the classifier dumped into a file so that it can be used later for testing data. This process also takes too much time and then crashes as it loads everything into memory first.

Is there a way to resolve this?

Another question is that is there a way to parallelize this whole operation i.e. parallelize the classification of this large dataset using some framework like Hadoop/MapReduce?

jigsaw
  • 165
  • 1
  • 16
  • Without more context and specific info on your situation it's difficult to help/answer your question. – AtAFork Nov 06 '14 at 04:35

1 Answers1

0

I hope you must increase the memory dynamically to overcome this problem. I hope this link will helps you Python Memory Management

Parallelism in Python

Community
  • 1
  • 1
BelieveToLive
  • 281
  • 1
  • 4
  • 18