TfidfVectorizer for corpus that cannot fit in memory

Question

I want to build a tf-idf model based on a corpus that cannot fit in memory. I read the tutorial but the corpus seems to be loaded at once:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["doc1", "doc2", "doc3"]
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit(corpus)

I wonder if I can load the documents into memory one by one instead of loading all of them.

When working with large corpora, it might be a good idea to use a recent development version rather than a stable release, as `TfidfVectorizer` was overhauled for reduced memory usage and improved speed. — Fred Foo, May 09 '13 at 20:20

score 13 · Accepted Answer · answered May 09 '13 at 06:39

Yes you can, just make your corpus an iterator. For example, if your documents reside on a disc, you can define an iterator that takes as an argument the list of file names, and returns the documents one by one without loading everything into memory at once.

from sklearn.feature_extraction.text import TfidfVectorizer

def make_corpus(doc_files):
    for doc in doc_files:
        yield load_doc_from_file(doc) #load_doc_from_file is a custom function for loading a doc from file

file_list = ... # list of files you want to load
corpus = make_corpus(file_list)
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit(corpus)

Any solutions if even the Tfidf vector representation cannot fit in memory? — mchangun, Oct 23 '13 at 12:04

TfidfVectorizer for corpus that cannot fit in memory

1 Answers1

Linked