Most efficient way to check if a pair is present in a text

Question

Introduction:

One of the features used by many sentiment analysis programs is calculated by assigning to relevant unigrams, bigrams or pairs a specific score according to a lexicon. More in detail:

An example lexicon could be:

//unigrams
good 1
bad -1
great 2
//bigrams
good idea 1
bad idea -1
//pairs (--- stands for whatever):
hold---up   -0.62
how---i still -0.62

Given a sample text T, for each each unigram, bigram or pair in T i want to check if a correspondence is present in the lexicon.

The unigram\bigram part is easy: i load the lexicon in a Map and then iterate my text, checking each word if present in the dictionary. My problems is with detecting pairs.

My Problem:

One way to check if specific pairs are present in my text would be to iterate the whole lexicon of pairs and use a regex on the text. Checking for each word in the lexicon if "start_of_pair.*end_of_pair" is present in the text. This seems very wasteful, because i'd have to iterate the WHOLE lexicon for each text to analyze. Any ideas on how to do this in a smarter way?

Related questions: Most Efficient Way to Check File for List of Words and Java: Most efficient way to check if a String is in a wordlist

As a simple first pass, you could look for the fist term of the pair, and if you find it, look for the second term in the remainder of the text. — yshavit, Jan 20 '14 at 18:20

score 0 · Answer 1 · answered Jan 20 '14 at 19:25

One could realize a frequency map of bigrams as:

Map<String, Map<String, Integer> bigramFrequencyMap = new TreeMap<>();

Fill the map with the desired bigrams with initial frequency 0. First lexeme, second lexeme, to frequency count.

static final int MAX_DISTANCE = 5;

Then a lexical scan would keep the last #MAX_DISTANCE lexemes.

List<Map<String, Integer>> lastLexemesSecondFrequencies = new ArrayList<>();

void processLexeme() {
     String lexeme = readLexeme();

     // Check whether there is a bigram:
     for (Map<String, Integer> prior : lastLexemesSecondFrequencies) {
          Integer freq = prior.get(lexeme);
          if (freq != null) {
              prior.put(lexeme, 1 + freq);
          }
     }

     Map<String, Integer> lexemeSecondFrequencies =
             bigramFrequencyMap.get(lexeme);
     if (lexemeSecondFrequencies != null) {
         // Could remove lexemeSecondFrequencies if present in lastLexemes.
         lastLexems.add(0, lexemeSecondFrequencies); // addFirst
         if (lastLexemes.size() > MAX_DISTANCE) {
             lastLexemes.remove(lastLexemes.size() - 1); // removeLast
         }
     }
}

The optimization is to keep the bigrams second half, and only handle registered bigrams.

score 0 · Answer 2 · answered Jan 22 '14 at 19:07

At the end i ended up solving it this way: I loaded the pair lexicon as Map<String, Map<String, Float>> - where the first key is the first half of the pairs, the inner map holds all the possible ending for that key's start and the corresponding sentiment value.

Basically i have a List of possible endings (enabledTokens) which i increase each time i read a new token - and then i search this list to see if the current token is the ending of some previous pair.

With a few modifications to prevent the previous token from being used right away for an ending, this is my code:

private Map<String, Map<String, Float>> firstPartMap;
private List<LexiconPair> enabledTokensForUnigrams, enabledTokensForBigrams;
private Queue<List<LexiconPair>> pairsForBigrams; //is initialized with two empty lists
private Token oldToken;

public void parseToken(Token token) {
    String unigram = token.getText();
    String bigram = null;
    if (oldToken != null) {
        bigram = oldToken.getText() + " " + token.getText();
    }

    checkIfPairMatchesAndUpdateFeatures(unigram, enabledTokensForUnigrams);
    checkIfPairMatchesAndUpdateFeatures(bigram, enabledTokensForBigrams);

    List<LexiconPair> pairEndings = toPairs(firstPartMap.get(unigram));
    if(bigram!=null)pairEndings.addAll(toPairs(firstPartMap.get(bigram)));
    pairsForBigrams.add(pairEndings);

    enabledTokensForUnigrams.addAll(pairEndings);
    enabledTokensForBigrams.addAll(pairsForBigrams.poll());

    oldToken = token;
}
private void checkIfPairMatchesAndUpdateFeatures(String text, List<LexiconPair> listToCheck) {
    Iterator<LexiconPair> iter = listToCheck.iterator();
    while (iter.hasNext()) {
        LexiconPair next = iter.next();
        if (next.getText().equals(text)) {
            float val = next.getValue();
            POLARITY polarity = getPolarity(val);
            for (LexiconFeatureSubset lfs : lexiconsFeatures) {
                lfs.handleNewValue(Math.abs(val), polarity);
            }
            //iter.remove();
            //return; //remove only 1 occurrence
        }
    }
}

Most efficient way to check if a pair is present in a text

Introduction:

An example lexicon could be:

My Problem:

2 Answers2