Introduction:
One of the features used by many sentiment analysis programs is calculated by assigning to relevant unigrams, bigrams or pairs a specific score according to a lexicon. More in detail:
An example lexicon could be:
//unigrams
good 1
bad -1
great 2
//bigrams
good idea 1
bad idea -1
//pairs (--- stands for whatever):
hold---up -0.62
how---i still -0.62
Given a sample text T, for each each unigram, bigram or pair in T i want to check if a correspondence is present in the lexicon.
The unigram\bigram part is easy: i load the lexicon in a Map and then iterate my text, checking each word if present in the dictionary. My problems is with detecting pairs.
My Problem:
One way to check if specific pairs are present in my text would be to iterate the whole lexicon of pairs and use a regex on the text. Checking for each word in the lexicon if "start_of_pair.*end_of_pair" is present in the text. This seems very wasteful, because i'd have to iterate the WHOLE lexicon for each text to analyze. Any ideas on how to do this in a smarter way?
Related questions: Most Efficient Way to Check File for List of Words and Java: Most efficient way to check if a String is in a wordlist