How to make Naive Bayes classifier for large datasets in python

Question

I have large datasets of 2-3 GB. I am using (nltk) Naive bayes classifier using the data as train data. When I run the code for small datasets, it runs fine but when run for large datasets it runs for a very long time(more than 8 hours) and then crashes without much of an error. I believe it is because of memory issue.

Also, after classifying the data I want the classifier dumped into a file so that it can be used later for testing data. This process also takes too much time and then crashes as it loads everything into memory first.

Is there a way to resolve this?

Another question is that is there a way to parallelize this whole operation i.e. parallelize the classification of this large dataset using some framework like Hadoop/MapReduce?

Without more context and specific info on your situation it's difficult to help/answer your question. — AtAFork, Nov 06 '14 at 04:35

score 0 · Answer 1 · edited May 23 '17 at 11:57

0

I hope you must increase the memory dynamically to overcome this problem. I hope this link will helps you Python Memory Management

Parallelism in Python

edited May 23 '17 at 11:57

Community

1
1

answered Nov 06 '14 at 04:57

BelieveToLive

281
1
4
18

How to make Naive Bayes classifier for large datasets in python

1 Answers1