TFIDFVectorizer takes so much memory ,vectorizing 470 MB of 100k documents takes over 6 GB , if we go 21 million documents it will not fit 60 GB of RAM we have.
So we go for HashingVectorizer but still need to know how to distribute the hashing vectorizer.Fit and partial fit does nothing so how to work with Huge Corpus?