2

So I'm doing tf-idf for very large corpus(100k documents) and it is giving me memory errors. Is there any implantation that can work well with such large number of documents? I want to make my own stopwords list. Also, it worked on 50k documents, what is the limit of number of documents I can use in this calculation if there is one (sklearn implantation).

  def tf_idf(self, df):
    df_clean, corpus = self.CleanText(df)
    tfidf=TfidfVectorizer().fit(corpus)
    count_tokens=tfidf.get_feature_names_out()
    article_vect = tfidf.transform(corpus)
    tf_idf_DF=pd.DataFrame(data=article_vect.toarray(),columns=count_tokens)
    tf_idf_DF = pd.DataFrame(tf_idf_DF.sum(axis=0).sort_values(ascending=False))

    return tf_idf_DF

The error: MemoryError: Unable to allocate 65.3 GiB for an array with shape (96671, 90622) and data type float64

Thanks in advance.

  • Which line do you get this on? – Nick ODell Feb 17 '22 at 00:10
  • See also: https://stackoverflow.com/questions/25145552/tfidf-for-large-dataset – Nick ODell Feb 17 '22 at 00:11
  • The limit isn't based on the number of documents. It's based on memory. Did you see that in the message? It's trying to allocate a 65GB array, and that's more memory than your system can allocate, even with a page file. You will have to use a smaller corpus. – Tim Roberts Feb 17 '22 at 00:16
  • @TimRoberts The problem is in the memory which can be avoided by reducing the number of documents. The link provided by NickODell shows other implementations such as Gensim's implementation to solve this problem. The other question is also talking about 8 million documents and it worked for them. –  Feb 17 '22 at 00:26
  • @FjkgB If it's only failing when tfidf is converting into a dataframe, it's likely a conversion from sparse array to dense array that's taking most of the memory. – Nick ODell Feb 17 '22 at 00:32
  • @NickODell makes sense, noted. Thank you for all your valuable answers and time! –  Feb 17 '22 at 00:36

1 Answers1

1

TfidfVectorizer has a lot of parameters(TfidfVectorizer), you should set max_df=0.9, min_df=0.1 and max_features=500 and gridsearch these parameters for best solution.

Without setting these parameters, you've got a huge sparsematrix with shape of (96671, 90622) that causing memory error..

welcome to nlp

prof_FL
  • 156
  • 9