Distinguishing well formed English sentences from "word salad"

Question

I'm looking for a library easily usable from C++, Python or F#, which can distinguish well formed English sentences from "word salad". I tried The Stanford Parser and unfortunately, it parsed this:

Some plants have with done stems animals with exercise that to predict?

without a complaint. I'm not looking for something very sophisticated, able to handle all possible corner cases. I only need to filter out an obvious nonsense.

Have a look at this question's second answer. http://stackoverflow.com/questions/10252448/how-to-check-whether-a-sentence-is-correct-simple-grammar-check-in-python Having a library check the sentence for grammatical and spelling errors, and not only try to find the most probable parse, should be the way to go. — HugoMailhot, Oct 11 '15 at 18:15
@HugoMailhot Good suggestion, but it is not going to be a smooth sailing. I fed my test sentence to the [LanguageTool](https://www.languagetool.org/) and it passed with flying colors. — Paul Jurczak, Oct 11 '15 at 18:52

score 4 · Answer 1 · edited Apr 13 '20 at 18:22

Here is something I just stumbled upon: A general-purpose sentence-level nonsense detector, by a Stanford student named Ian Tenney.

Here is the code from the project, undocumented but available on GitHub.

If you want to develop your own solution based on this, I think you should pay attention the 4th group of features used, ie the language model, under section 3 "Features and preprocessing".

It might not suffice, but I think getting a probability score of each subsequences of length n is a good start. 3-grams like "plants have with", "have with done", "done stems animals", "stems animals with" and "that to predict" seem rather improbable, which could lead to a "nonsense" label on the whole sentence.

This method has the advantage of relying on a learned model rather than on a set of hand-made rules, which afaik is your other option. Many people would point you to Chapter 8 of NLTK's manual, but I think that developing your own context-free grammar for general English is asking a bit much.

Marc Maxmeister · Answer 2 · 2018-08-01T17:02:10.633

The paper was useful, but goes into too much depth for solving this problem. Here is the author's basic approach, heuristically:

Baseline sentence heuristic: first letter is Capitalized, and line ends with one of .?! (1 feature).
Number of characters, words, punctuation, digits, and named entities (from Stanford CoreNLP NER tagger), and normalized versions by text length (10 features).
Part-of-speech distributional tags: (# / # words) for each Penn treebank tag (45 features).
Indicators for the part of speech tag of the first and last token in the text (45x2 = 90 features).
Language model raw score (s lm = log p(text)) and normalized score (s¯lm = s lm / # words) (2 features).

However, after a lot of searching, the github repo only includes the tests and visualizations. The raw training and test data are not there. Here is his function for calculating these features:

(note: this uses pandas dataframes as df)

def make_basic_features(df):
    """Compute basic features."""

    df['f_nchars'] = df['__TEXT__'].map(len)
    df['f_nwords'] = df['word'].map(len)

    punct_counter = lambda s: sum(1 for c in s
                                  if (not c.isalnum())
                                      and not c in
                                        [" ", "\t"])
    df['f_npunct'] = df['__TEXT__'].map(punct_counter)
    df['f_rpunct'] = df['f_npunct'] / df['f_nchars']

    df['f_ndigit'] = df['__TEXT__'].map(lambda s: sum(1 for c in s
                                  if c.isdigit()))
    df['f_rdigit'] = df['f_ndigit'] / df['f_nchars']

    upper_counter = lambda s: sum(1 for c in s if c.isupper())
    df['f_nupper'] = df['__TEXT__'].map(upper_counter)
    df['f_rupper'] = df['f_nupper'] / df['f_nchars']

    df['f_nner'] = df['ner'].map(lambda ts: sum(1 for t in ts
                                              if t != 'O'))
    df['f_rner'] = df['f_nner'] / df['f_nwords']

    # Check standard sentence pattern:
    # if starts with capital, ends with .?!
    def check_sentence_pattern(s):
        ss = s.strip(r"""`"'""").strip()
        return s[0].isupper() and (s[-1] in '.?!')
    df['f_sentence_pattern'] = df['__TEXT__'].map(check_sentence_pattern)

    # Normalize any LM features
    # by dividing logscore by number of words
    lm_cols = {c:re.sub("_lmscore_", "_lmscore_norm_",c)
               for c in df.columns if c.startswith("f_lmscore")}
    for c,cnew in lm_cols.items():
        df[cnew] = df[c] / df['f_nwords']

    return df

So I guess that's a function you can use in this case. For the minimalist version:

raw = ["This is is a well-formed sentence","but this ain't a good sent","just a fragment"]
import pandas as pd
df = pd.DataFrame([{"__TEXT__":i, "word": i.split(), 'ner':[]} for i in raw])

the parser seems to want a list of the words, and named entities recognized (NER) using the Stanford CoreNLP library, which is written in Java. You can pass in nothing (an empty list []) and the function do calculate everything else. You'll get back a dataframe (like a matrix) with all the features of sentences that you can then used to decide what to call "well formed" by the rules given.

Also, you don't HAVE to use pandas here. A list of dictionaries will also work. But the original code used pandas.

Because this example involved a lot of steps, I've created a gist where I run through an example up to the point of producing a clean list of sentences and a dirty list of not-well-formed sentences

my gist: https://gist.github.com/marcmaxson/4ccca7bacc72eb6bb6479caf4081cefb

This replaces the Stanford CoreNLP java library with spacy - a newer and easier to use python library that fills in the missing meta data, such as sentiment, named entities, and parts of speech used to determine if a sentence is well-formed. This runs under python 3.6, but could work under 2.7. all the libraries are backwards compatible.

AS I improve on this, I will next look for subject-verb in sentence structure, and sentences don't end with an adjective... stuff like that. — Marc Maxmeister, Aug 01 '18 at 17:06

Distinguishing well formed English sentences from "word salad"

2 Answers2