Here's a solution where correctness is defined as: an
comes before a word that starts with a vowel sound, otherwise a
may be used:
#!/usr/bin/env python
import itertools
import re
import sys
try:
from future_builtins import map, zip
except ImportError: # Python 3 (or old Python versions)
map, zip = map, zip
from operator import methodcaller
import nltk # $ pip install nltk
from nltk.corpus import cmudict # >>> nltk.download('cmudict')
def starts_with_vowel_sound(word, pronunciations=cmudict.dict()):
for syllables in pronunciations.get(word, []):
return syllables[0][-1].isdigit() # use only the first one
def check_a_an_usage(words):
# iterate over words pairwise (recipe from itertools)
#note: ignore Unicode case-folding (`.casefold()`)
a, b = itertools.tee(map(methodcaller('lower'), words))
next(b, None)
for a, w in zip(a, b):
if (a == 'a' or a == 'an') and re.match('\w+$', w):
valid = (a == 'an') if starts_with_vowel_sound(w) else (a == 'a')
yield valid, a, w
#note: you could use nltk to split text in paragraphs,sentences, words
pairs = ((a, w)
for sentence in sys.stdin.readlines() if sentence.strip()
for valid, a, w in check_a_an_usage(nltk.wordpunct_tokenize(sentence))
if not valid)
print("Invalid indefinite article usage:")
print('\n'.join(map(" ".join, pairs)))
Example input (one sentence per line)
Validity is defined as `an` comes before a word that starts with a
vowel sound, otherwise `a` may be used.
Like "a house", but "an hour" or "a European" (from @Hyperboreus's comment http://stackoverflow.com/questions/20336524/gramatically-correct-an-english-text-python#comment30353583_20336524 ).
A AcRe, an AcRe, a rhYthM, an rhYthM, a yEarlY, an yEarlY (words from @tchrist's comment http://stackoverflow.com/questions/9505714/python-how-to-prepend-the-string-ub-to-every-pronounced-vowel-in-a-string#comment12037821_9505868 )
We have found a (obviously not optimal) solution." vs. "We have found an obvious solution (from @Hyperboreus answer)
Wait, I will give you an... -- he shouted, but dropped dead before he could utter the last word. (ditto)
Output
Invalid indefinite article usage:
a acre
an rhythm
an yearly
It is not obvious why the last pair is invalid, see Why is it “an yearly”?