The paper was useful, but goes into too much depth for solving this problem. Here is the author's basic approach, heuristically:
- Baseline sentence heuristic: first letter is Capitalized,
and line ends with one of .?! (
1 feature
).
- Number of characters, words, punctuation, digits, and named entities (from Stanford CoreNLP NER tagger), and normalized versions by text length (
10 features
).
- Part-of-speech distributional tags: (# / # words) for each Penn treebank tag (
45 features
).
- Indicators for the part of speech tag of the first
and last token in the text (
45x2 = 90 features
).
- Language model raw score (s lm = log p(text))
and normalized score (s¯lm = s lm / # words) (
2 features
).
However, after a lot of searching, the github repo only includes the tests and visualizations. The raw training and test data are not there. Here is his function for calculating these features:
(note: this uses pandas dataframes as df)
def make_basic_features(df):
"""Compute basic features."""
df['f_nchars'] = df['__TEXT__'].map(len)
df['f_nwords'] = df['word'].map(len)
punct_counter = lambda s: sum(1 for c in s
if (not c.isalnum())
and not c in
[" ", "\t"])
df['f_npunct'] = df['__TEXT__'].map(punct_counter)
df['f_rpunct'] = df['f_npunct'] / df['f_nchars']
df['f_ndigit'] = df['__TEXT__'].map(lambda s: sum(1 for c in s
if c.isdigit()))
df['f_rdigit'] = df['f_ndigit'] / df['f_nchars']
upper_counter = lambda s: sum(1 for c in s if c.isupper())
df['f_nupper'] = df['__TEXT__'].map(upper_counter)
df['f_rupper'] = df['f_nupper'] / df['f_nchars']
df['f_nner'] = df['ner'].map(lambda ts: sum(1 for t in ts
if t != 'O'))
df['f_rner'] = df['f_nner'] / df['f_nwords']
# Check standard sentence pattern:
# if starts with capital, ends with .?!
def check_sentence_pattern(s):
ss = s.strip(r"""`"'""").strip()
return s[0].isupper() and (s[-1] in '.?!')
df['f_sentence_pattern'] = df['__TEXT__'].map(check_sentence_pattern)
# Normalize any LM features
# by dividing logscore by number of words
lm_cols = {c:re.sub("_lmscore_", "_lmscore_norm_",c)
for c in df.columns if c.startswith("f_lmscore")}
for c,cnew in lm_cols.items():
df[cnew] = df[c] / df['f_nwords']
return df
So I guess that's a function you can use in this case. For the minimalist version:
raw = ["This is is a well-formed sentence","but this ain't a good sent","just a fragment"]
import pandas as pd
df = pd.DataFrame([{"__TEXT__":i, "word": i.split(), 'ner':[]} for i in raw])
the parser seems to want a list of the words, and named entities recognized (NER) using the Stanford CoreNLP library, which is written in Java. You can pass in nothing (an empty list []
) and the function do calculate everything else. You'll get back a dataframe (like a matrix) with all the features of sentences that you can then used to decide what to call "well formed" by the rules given.
Also, you don't HAVE to use pandas here. A list of dictionaries will also work. But the original code used pandas.
Because this example involved a lot of steps, I've created a gist where I run through an example up to the point of producing a clean list of sentences and a dirty list of not-well-formed sentences
my gist: https://gist.github.com/marcmaxson/4ccca7bacc72eb6bb6479caf4081cefb
This replaces the Stanford CoreNLP
java library with spacy
- a newer and easier to use python library that fills in the missing meta data, such as sentiment, named entities, and parts of speech used to determine if a sentence is well-formed. This runs under python 3.6, but could work under 2.7. all the libraries are backwards compatible.