I am relatively new to NLP & Sentiment analysis, but I am enrolled in a Machine Learning class and am creating a Sentiment Analysis NLP that will read a financial article and determine whether or not the overall sentiment is good or bad.
Currently, I have a dataset of about 2000 articles. I know that I need to implement the TF-IDF vector method to cast all the instances in the dataset to the same vector space. Also, I know that TF-IDF requires a "Vocabulary" and the size of this "Vocabulary" is the length of the vector, each vector representing an article.
My question is, how do I determine this vocabulary? One method I have found is to implement pre-processing (get rid of stop words, noisy words, punctuation, etc.) and then use ALL words in EVERY article in the training set. From here you can remove the words that have very few instances (unimportant words) and remove the words that have too many instances (non-distinguishing words). However, in my opinion, the "Vocabulary" is still going to be quite large, hence, the vector size is going to be very large.
Overall, this approach seems logical, but processing heavy. I feel that initially creating a "Vocabulary" containing all words in every article is going to be HUGE. And then iterating through every article to see how many times the words in the "Vocabulary" have occurred is going to require a lot of processing power. If I am using NLTK and scikit-learn, do I have anything to worry about? If so, is there a better way to create the vocabulary?