45

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.

Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?

apurva.nandan
  • 1,061
  • 1
  • 11
  • 19
  • 5
    Have you seen t[his](http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html) and [this](http://scikit-learn.org/stable/modules/scaling_strategies.html)? – EdChum Aug 06 '14 at 07:15
  • 3
    @EdChum This method uses HashingVectorizer (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer) for features which indeed is scalable and best for large datasets but you can see in the cons section, it states that it cannot be used to compute IDF weights. – apurva.nandan Aug 06 '14 at 07:30
  • You'd have to do this iteratively if it can't all fit into memory or store in a DB/pytables and execute a query to calc the tfidf and store as another table – EdChum Aug 06 '14 at 07:33
  • I need to get a csr matrix, I think iterations will not preserve the shape of the matrix since we would be feeding a new batch of data everytime and getting matrices of different shape, correct me if am wrong. Also how will the idf weights will be taken into account, they are global weights. Using a python generator will help do you think to fist calculate TF via CountVectorizer and then transform it maybe for IDF weights? – apurva.nandan Aug 06 '14 at 07:55
  • 1
    I believe you can use a `HashingVectorizer` to get a smallish `car_matrix` out of your text data and then use a `TfidfTransformer` on that. Storing a sparse matrix of 8M rows and several tens of thousands of columns isn't such a big deal. – mbatchkarov Aug 06 '14 at 10:54
  • Have you considered using Mahout - https://mahout.apache.org/ - it's scalable for large datasets and can easily be used with lucene. – nbz Aug 11 '14 at 12:02

4 Answers4

31

Gensim has an efficient tf-idf model and does not need to have everything in memory at once.

Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.

The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 4
    Using an iterable is the way to go actually. I finally used [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) with an iterable to the corpus – apurva.nandan Sep 10 '15 at 08:24
  • What do you mean by "iterable"? Could you explain that, please? – Ivan Bilan Feb 28 '16 at 00:14
  • @ivan_bilan Look for python generators , that should explain – apurva.nandan Apr 26 '16 at 08:10
  • gensim requires a bag of words to work on tf-idf? https://radimrehurek.com/gensim/tut2.html#available-transformations – InsParbo May 27 '18 at 20:02
  • 1
    @apurva.nandan can you please explain how to perform tfidf on pandas dataframe using iterable? – James Jul 18 '18 at 04:52
  • @James you can use `df.iterrows()` to iterate through the rows of a df – Matt Nov 20 '20 at 12:35
15

I believe you can use a HashingVectorizer to get a smallish csr_matrix out of your text data and then use a TfidfTransformer on that. Storing a sparse matrix of 8M rows and several tens of thousands of columns isn't such a big deal. Another option would be not to use TF-IDF at all- it could be the case that your system works reasonably well without it.

In practice you may have to subsample your data set- sometimes a system will do just as well by just learning from 10% of all available data. This is an empirical question, there is not way to tell in advance what strategy would be best for your task. I wouldn't worry about scaling to 8M document until I am convinced I need them (i.e. until I have seen a learning curve showing a clear upwards trend).

Below is something I was working on this morning as an example. You can see the performance of the system tends to improve as I add more documents, but it is already at a stage where it seems to make little difference. Given how long it takes to train, I don't think training it on 500 files is worth my time.

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • I essentially want to calculate the nearest neighbours of all the articles (web page text so the size is quite huge) once I have them. Would you suggest that I should still ignore IDF counts and take only the token counts. Btw, How about splitting the data and running parallel jobs then? But I guess then merging the results (matrices and their different shapes) would be cumbersome. I am visualizing them, so Afterwards maybe I can check where to stop and how much data to take. – apurva.nandan Aug 06 '14 at 16:48
  • 2
    PS I mentioned that holding the sparse term-document matrix is probably not going the be the problem. Jonathan's answer says the same thing- holding the text representation of all webpages is the hard part. However, you can get around that by streaming (using a generator) into a vectoriser. This is how `gensim` achieves such a tiny memory footprint even when processing very larger corpora. Have a look at the tutorial here: http://radimrehurek.com/gensim/tut1.html – mbatchkarov Aug 07 '14 at 08:25
  • Yes precisely actually I had started using this approach before the topic of genism came, it saves a lot of memory using lazy evaluation and featurehasher. Now the problem is storage, how do I store/dump such a large csr matrix, am getting memory errors, any clues? – apurva.nandan Aug 07 '14 at 18:17
  • If you can hold it in memory, you shouldn't have a problem saving it to disk. What format are you using? Is a conversion to dense happening? – mbatchkarov Aug 08 '14 at 08:14
  • No, it's a csr matrix only, problem was I was using bigrams and trigrams too which resulted in a lot of memory usage. And pickle couldn't dump such a large matrix hence. I started using HDF5 format via pytables and it's all good now :) – apurva.nandan Aug 09 '14 at 08:57
5

I solve that problem using sklearn and pandas.

Iterate in your dataset once using pandas iterator and create a set of all words, after that use it in CountVectorizer vocabulary. With that the Count Vectorizer will generate a list of sparse matrix all of them with the same shape. Now is just use vstack to group them. The sparse matrix resulted have the same information (but the words in another order) as CountVectorizer object and fitted with all your data.

That solution is not the best if you consider the time complexity but is good for memory complexity. I use that in a dataset with 20GB +,

I wrote a python code (NOT THE COMPLETE SOLUTION) that show the properties, write a generator or use pandas chunks for iterate in your dataset.

from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import vstack


# each string is a sample
text_test = [
    'good people beauty wrong',
    'wrong smile people wrong',
    'idea beauty good good',
]

# scikit-learn basic usage

vectorizer = CountVectorizer()

result1 = vectorizer.fit_transform(text_test)
print(vectorizer.inverse_transform(result1))
print(f"First approach:\n {result1}")

# Another solution is

vocabulary = set()

for text in text_test:
    for word in text.split():
        vocabulary.add(word)

vectorizer = CountVectorizer(vocabulary=vocabulary)

outputs = [] 
for text in text_test: # use a generator
    outputs.append(vectorizer.fit_transform([text]))


result2 = vstack(outputs)
print(vectorizer.inverse_transform(result2))

print(f"Second approach:\n {result2}")

Finally, use TfidfTransformer.

Felipe Borges
  • 405
  • 4
  • 8
-2

The lengths of the documents The number of terms in common Whether the terms are common or unusual How many times each term appears