0

I've written a small C# program that compiles a bunch of words into a line of text and I want to use NLP only to give me a percentage possibility that the bunch of words is a sentence. I don't need tokens, or tagging, all that can be in the background if it needs to be done. I have OpenNLP and SharpEntropy referenced in my project, but I'm coming up with an error "Array dimensions exceeded supported range." when using these, so I've also attempted using IKVM created OpenNLP without sharp entropy, but without documentation, I can't seem to wrap my head around the proper steps to get only the percentage probability.

Any help or direction would be appreciated.

user3530692
  • 11
  • 1
  • 3

1 Answers1

0

I'll recommend 2 relatively simple measures that might help you classify a word sequence as sentence/non-sentence. Unfortunately, I don't know how well SharpNLP will handle either. More complete toolkits exist in Java, Python, and C++ (LingPipe, Stanford CoreNLP, GATE, NLTK, OpenGRM, ...)

Language-model probability: Train a language model on sentences with start and stop tokens at the beginning/end of the sentence. Compute the probability of your target sequence per that language model. Grammatical and/or semantically sensible word sequences will score much higher than random word sequences. This approach should work with a standard n-gram model, a discriminative conditional probability model, or pretty much any other language modeling approach. But definitely start with a basic n-gram model.

Parse tree probability: Similarly, you can measure the inside probability of recovered constituency structure (e.g. via a probabilistic context free grammar parse). More grammatical sequences (i.e., more likely to be a complete sentence) will be reflected in higher inside probabilities. You will probably get better results if you normalize by the sequence length (the same may apply to a language-modeling approach as well).

I've seen preliminary (but unpublished) results on tweets, that seem to indicate a bimodal distribution of normalized probabilities - tweets that were judged more grammatical by human annotators often fell within a higher peak, and those judged less grammatical clustered into a lower one. But I don't know how well those results would hold up in a larger or more formal study.

AaronD
  • 1,701
  • 13
  • 12