9

I want to create a program that reads text from a file and points out when "a" and "an" is used incorrect. The general rule as far as I know is that "an" is used when the next words starts with a vowel. But it should also take into consideration that there are exceptions which also should be read from a file.

Could someone give me some tips and tricks on how I should get started with this. Functions or so that could help.

I would be very glad :-)

I'm quite new to Python.

Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
user3058751
  • 123
  • 1
  • 6
  • 2
    Welcome to stack overflow. What have you tried and what problem did you have? – cmd Dec 02 '13 at 19:48
  • 7
    AFAIK, the criterion is whether the next word starts with a vowel or consonant __sound__, no matter which grapheme. Like "a house", but "an hour" or "a European". So basically you need to infer somehow the pronunciation of the following word. – Hyperboreus Dec 02 '13 at 19:49
  • 4
    Understanding natural languages is a very difficult problem and far from being solved. This is a big research topic and definitely not something you should choose as a programming beginner. – poke Dec 02 '13 at 19:51
  • 2
    well if all he wants to do is make words starting with a vowel prefixed by `an` ... thats not so hard ... if he wants to actually correct the grammar that is – Joran Beasley Dec 02 '13 at 19:52
  • 2
    You could start with "I want to create a program that reads text from a file and points out where "a" and "an" are used incorrectly." – Steve Barnes Dec 02 '13 at 19:54
  • 1
    What Hyperboreus said is actually _understating_ the problem. First, different dialects have different silent letters for the same words—like "historical" starts with a consonant in American and most British dialects, but with a vowel in a few. Second, many dialects froze the `a`/`an` rules long ago but changed pronunciation since then, so an upper-class Londoner might say "an historical event" even though he pronounces the `h`. – abarnert Dec 02 '13 at 20:07
  • 1
    related: [Python: How to prepend the string 'ub' to every pronounced vowel in a string?](http://stackoverflow.com/q/9505714/4279). (on how to detect vowel sounds) – jfs Dec 02 '13 at 20:07
  • Well, thank you for your input. I know my english isn't perfect, since it's my second language. And this software does not have to be perfect, but yes I am aware that it is the sound that matters. That was what I meant with exceptions. – user3058751 Dec 02 '13 at 20:18
  • 1
    A pity that this question has been put on hold, as the answer provided by @J.F.Sebastian is really helpful in this context, and give a nice and short introduction to nltk. – Hyperboreus Dec 02 '13 at 21:53
  • Try [inflect Python library](https://pypi.python.org/pypi/inflect) – Dennis Golomazov Mar 21 '16 at 16:21

4 Answers4

13

Here's a solution where correctness is defined as: an comes before a word that starts with a vowel sound, otherwise a may be used:

#!/usr/bin/env python
import itertools
import re
import sys

try:
    from future_builtins import map, zip
except ImportError: # Python 3 (or old Python versions)
    map, zip = map, zip
from operator import methodcaller

import nltk  # $ pip install nltk
from nltk.corpus import cmudict  # >>> nltk.download('cmudict')

def starts_with_vowel_sound(word, pronunciations=cmudict.dict()):
    for syllables in pronunciations.get(word, []):
        return syllables[0][-1].isdigit()  # use only the first one

def check_a_an_usage(words):
    # iterate over words pairwise (recipe from itertools)
    #note: ignore Unicode case-folding (`.casefold()`)
    a, b = itertools.tee(map(methodcaller('lower'), words)) 
    next(b, None)
    for a, w in zip(a, b):
        if (a == 'a' or a == 'an') and re.match('\w+$', w): 
            valid = (a == 'an') if starts_with_vowel_sound(w) else (a == 'a')
            yield valid, a, w

#note: you could use nltk to split text in paragraphs,sentences, words
pairs = ((a, w)
         for sentence in sys.stdin.readlines() if sentence.strip() 
         for valid, a, w in check_a_an_usage(nltk.wordpunct_tokenize(sentence))
         if not valid)

print("Invalid indefinite article usage:")
print('\n'.join(map(" ".join, pairs)))

Example input (one sentence per line)

Validity is defined as `an` comes before a word that starts with a
vowel sound, otherwise `a` may be used.
Like "a house", but "an hour" or "a European" (from @Hyperboreus's comment http://stackoverflow.com/questions/20336524/gramatically-correct-an-english-text-python#comment30353583_20336524 ).
A AcRe, an AcRe, a rhYthM, an rhYthM, a yEarlY, an yEarlY (words from @tchrist's comment http://stackoverflow.com/questions/9505714/python-how-to-prepend-the-string-ub-to-every-pronounced-vowel-in-a-string#comment12037821_9505868 )
We have found a (obviously not optimal) solution." vs. "We have found an obvious solution (from @Hyperboreus answer)
Wait, I will give you an... -- he shouted, but dropped dead before he could utter the last word. (ditto)

Output

Invalid indefinite article usage:
a acre
an rhythm
an yearly

It is not obvious why the last pair is invalid, see Why is it “an yearly”?

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 1
    +1. Great answer. The only open problem is how to actually identify the undefined articles, but I fear you would need to completely parse the sentence to find all "DP -> D* NP" productions (or "NP -> DP N*" depending on which syntactic school you are and how you look at determiners). – Hyperboreus Dec 02 '13 at 21:49
4

Maybe this can give you a rough guideline:

  1. You need to parse the input text into prosodic units, as I doubt that the rules for "a/an" apply over prosodic boundaries (e.g. "We have found a (obviously not optimal) solution." vs. "We have found an obvious solution").

  2. Next you need to parse each prosodic unit into phonological words.

  3. Now you somehow need to identify those words, which represent the undefined article ("a house" vs "grade A product").

  4. Once you have identified the articles, look at the next word in your prosodic unit and determine (here be dragons) the syllabic feature of the first phoneme of this word.

  5. If it has [+syll] the article should be "an". If it has [-syll] the article should be "a". If the article is at the end of the prosodic unit, it should be maybe "a" (But what about ellipses: "Wait, I will give you an... -- he shouted, but dropped dead before he could utter the last word."). Except historical exceptions as mentioned by abanert, dialectal variance, etc, etc.

  6. If the found article doesn't match the expected, mark this as "incorrect".


Here some pseudocode:

def parseProsodicUnits(text): #here be dragons
def parsePhonologicalWords(unit): #here be dragons
def isUndefinedArticle(word): #here be dragons
def parsePhonemes(word): #here be dragons
def getFeatures(phoneme): #here be dragons

for unit in parseProsodicUnits(text):
    for idx, word in enumerate (parsePhonologicalWords(unit)[:-1]):
        if not isUndefinedArticle(word): continue
        syllabic = '+syll' in getFeatures(parsePhonemes(unit[idx+1])[0])
        if (word == 'a' and syllabic) or (word == 'an' and not syllabic):
            print ('incorrect')
Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
  • I've implemented some of your suggestions based on `nltk.corpus.cmudict` pronunciations dictionary. It might be useful to add examples when `a` or `an` do not represent an indefinite article, the article is at the end of the prosodic unit, historical exceptions, dialectal variance that actually break [my naive code](http://stackoverflow.com/a/20337527/4279). – jfs Dec 02 '13 at 21:47
1
all_words = "this is an wonderful life".split()
for i in range(len(all_words)):
    if all_words[i].lower() in ["a","an"]:
       if all_words[i+1][0].lower() in "aeiou":
           all_words[i] = all_words[i][0]+"n"
       else:
           all_words[i] = all_words[i][0]
print " ".join(all_words)

that should get you started , however it is not a complete solution....

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
1

I'd probably start with an approach like:

exceptions = set(/*a whole bunch of exceptions*/)
article = None
for word in text.split():
    if article:
        vowel = word[0].lower() in "aeiou"
        if word.lower() in exceptions:
            vowel = not vowel
        if (article.lower() == "an" and not vowel) or (article.lower() == "a" and vowel):
            print "Misused article '%s %s'" % (article, word)
        article = None
    if word.lower() in ('a', 'an'):
       article = word
Steve Barnes
  • 27,618
  • 6
  • 63
  • 73
cmd
  • 5,754
  • 16
  • 30