2

I am relatively new to NLP & Sentiment analysis, but I am enrolled in a Machine Learning class and am creating a Sentiment Analysis NLP that will read a financial article and determine whether or not the overall sentiment is good or bad.

Currently, I have a dataset of about 2000 articles. I know that I need to implement the TF-IDF vector method to cast all the instances in the dataset to the same vector space. Also, I know that TF-IDF requires a "Vocabulary" and the size of this "Vocabulary" is the length of the vector, each vector representing an article.

My question is, how do I determine this vocabulary? One method I have found is to implement pre-processing (get rid of stop words, noisy words, punctuation, etc.) and then use ALL words in EVERY article in the training set. From here you can remove the words that have very few instances (unimportant words) and remove the words that have too many instances (non-distinguishing words). However, in my opinion, the "Vocabulary" is still going to be quite large, hence, the vector size is going to be very large.

Overall, this approach seems logical, but processing heavy. I feel that initially creating a "Vocabulary" containing all words in every article is going to be HUGE. And then iterating through every article to see how many times the words in the "Vocabulary" have occurred is going to require a lot of processing power. If I am using NLTK and scikit-learn, do I have anything to worry about? If so, is there a better way to create the vocabulary?

  • Your data set seems small. You should be fine. There are much more clever ways of counting than going through every word in a vocabulary per document and counting each word sequentially. You only need a single-pass of each document... – juanpa.arrivillaga Apr 17 '18 at 08:06

2 Answers2

0

First of all, I don't think you have anything to worry about. These libraries were made to handle such (and actually even larger) corporas of data. Some methods read all the pages of the english wikipedia, so 2000 articles seem easy enough.

There are methods to create a smaller, more efficient vocabularies to describe every word. You could check "word to vec", for example, which is a very important part of NLP. I'd even suggest using it in your case, since it tends to have better results in tasks such as sentiment analysis (however if the course is specifically teaching TF-IDF then I obviously withdraw the suggestion).

If your vocabulary is too large for you, you could also pick different stemmers (what you use to remove puncuation from words in the preprocessing stage). While the most used stemmer is "Snowball", the "Lancaster" is more aggresive (and so will result in less of a difference between words). You can read about it here: What are the major differences and benefits of Porter and Lancaster Stemming algorithms?

Enjoy getting to know NLP, it's an amazing subject :)

TsurG
  • 246
  • 2
  • 4
0

A basic approach for the sentiment analysis, involves making a vocabulary out of training corpus and use it for making feature vectors for your data. A vocabulary as large as few hundred thousand words is quite common and nothing to worry about. The main challenge in this approach is actually opposite of what you are thinking. You should find ways of increasing the size of your vocabulary rather than decreasing it.

You can try to enhance the vocabulary by using other sentiment analysis lexicons like SentiWordNet too.

As far as implementation of your approach is concerned you can build a scikit pipeline envolving CountVectorize to build vocabulary and feature vector. One advantage of using CountVectorize() for building vocabulary is that it uses a sparse matrix for building the vector which handles your concern of large size. Then using TfIdf Vectorizer for calculating term frequencies and inverse term frequencies, and then finally a model for training.

Consider adding some more features to your vector apart from pure bag of words. Be sure to perform a GridSearch on your model and preprocessing stages to fine-tune the parameters for best accuracy. I recently did a similar project for sentiment analysis of stocktwits data. I used a Naive Bayes classifier and got an accuracy of 72%. Naive Bayes proved to be better than even some deep learning models like RNN/DNN classifiers. The model selection, though independent of your question, is an integral part of building your project so keep tweaking it till you get good results. Check out my project if you want some insights on my implementation.

Be mindful of following points while doing your project:

  • Some researchers believe that stop words actually add meaning to sentiment so I would recommend not removing them during the preprocessing phase. See this paper
  • Always use domain knowledge while doing sentiment analysis. A negative sentiment in one domain like "predictable movie" can be positive in other like "predictable share market".
  • Don't remove words on your own (on the basis of frequency as you mentioned in the question) from the vocabulary. TfIdf normalization is meant for this purpose only.

The field of sentiment analysis is filled with numerous researches and exciting new techniques. I would recommend you to read some papers like this by pioneers in this field.

penguin2048
  • 1,303
  • 13
  • 25