14

I have a database of nouns (ex "house", "exclamation point", "apple") that I need to output and describe in my application. It's hard to put together a natural-sounding sentence to describe an item without using "a" or "an" - "a house is BIG", "an exclamation point is SMALL", etc.

Is there any function, library, or hack i can use in PHP to determine whether it is more appropriate to describe any given noun with A or AN?

durron597
  • 31,968
  • 17
  • 99
  • 158
MarathonStudios
  • 283
  • 1
  • 9
  • 17
  • If there was a php core function that did this I might actually begin to think that PHP is bloated. – webbiedave Dec 29 '10 at 22:39
  • gramatically, words beginning with h (like house) could be either a or an... e.g. an honest man (where the h is silent), a hound dog (where the h is hard) a/an hotel, (depending on how pretentious the speaker is) – Mark Baker Dec 29 '10 at 22:58
  • @webbiedave - if there was a php core function that did this, I'd begin to think that the PHP core developers had forgotten the existence of languages other than English... and for my next trick, a function to identify whether a French noun should be preceded with le, la, les or l' – Mark Baker Dec 29 '10 at 23:00
  • lol, Mark Baker. I think the function would automatically switch to that based on locale settings :) – webbiedave Dec 29 '10 at 23:03

8 Answers8

10

I needed this for a C# project so here's the C# port of the Python code mentioned above. Make sure to include using System.Text.RegularExpressions; in your source file.

private string GetIndefiniteArticle(string noun_phrase)
{
    string word = null;
    var m = Regex.Match(noun_phrase, @"\w+");
    if (m.Success)
        word = m.Groups[0].Value;
    else
        return "an";

    var wordi = word.ToLower();
    foreach (string anword in new string[] { "euler", "heir", "honest", "hono" })
        if (wordi.StartsWith(anword))
            return "an";

    if (wordi.StartsWith("hour") && !wordi.StartsWith("houri"))
        return "an";

    var char_list = new char[] { 'a', 'e', 'd', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 'x' };
    if (wordi.Length == 1)
    {
        if (wordi.IndexOfAny(char_list) == 0)
            return "an";
        else
            return "a";
    }

    if (Regex.Match(word, "(?!FJO|[HLMNS]Y.|RY[EO]|SQU|(F[LR]?|[HL]|MN?|N|RH?|S[CHKLMNPTVW]?|X(YL)?)[AEIOU])[FHLMNRSX][A-Z]").Success)
        return "an";

    foreach (string regex in new string[] { "^e[uw]", "^onc?e\b", "^uni([^nmd]|mo)", "^u[bcfhjkqrst][aeiou]" })
    {
        if (Regex.IsMatch(wordi, regex))
            return "a";
    }

    if (Regex.IsMatch(word, "^U[NK][AIEO]"))
        return "a";
    else if (word == word.ToUpper())
    {
        if (wordi.IndexOfAny(char_list) == 0)
            return "an";
        else
            return "a";
    }

    if (wordi.IndexOfAny(new char[] { 'a', 'e', 'i', 'o', 'u' }) == 0)
        return "an";

    if (Regex.IsMatch(wordi, "^y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)"))
        return "an";

    return "a";
}
Stuart
  • 583
  • 5
  • 9
  • If you're looking for a C# implementation, I wrote http://code.google.com/p/a-vs-an/ which deals even with odd corner cases by using actual usage patterns extracted from wikipedia. It's AvsAn on nuget. – Eamon Nerbonne Mar 30 '13 at 16:30
9

I was also looking for such solution but in JavaScript. So I ported it over to JS, you can check out the actual project in github https://github.com/rigoneri/indefinite-article.js

Here is the code snippet:

 function indefinite_article(phrase) {

    // Getting the first word 
    var match = /\w+/.exec(phrase);
    if (match)
        var word = match[0];
    else
        return "an";

    var l_word = word.toLowerCase();
    // Specific start of words that should be preceeded by 'an'
    var alt_cases = ["honest", "hour", "hono"];
    for (var i in alt_cases) {
        if (l_word.indexOf(alt_cases[i]) == 0)
            return "an";
    }

    // Single letter word which should be preceeded by 'an'
    if (l_word.length == 1) {
        if ("aedhilmnorsx".indexOf(l_word) >= 0)
            return "an";
        else
            return "a";
    }

    // Capital words which should likely be preceeded by 'an'
    if (word.match(/(?!FJO|[HLMNS]Y.|RY[EO]|SQU|(F[LR]?|[HL]|MN?|N|RH?|S[CHKLMNPTVW]?|X(YL)?)[AEIOU])[FHLMNRSX][A-Z]/)) {
        return "an";
    }

    // Special cases where a word that begins with a vowel should be preceeded by 'a'
    regexes = [/^e[uw]/, /^onc?e\b/, /^uni([^nmd]|mo)/, /^u[bcfhjkqrst][aeiou]/]
    for (var i in regexes) {
        if (l_word.match(regexes[i]))
            return "a"
    }

    // Special capital words (UK, UN)
    if (word.match(/^U[NK][AIEO]/)) {
        return "a";
    }
    else if (word == word.toUpperCase()) {
        if ("aedhilmnorsx".indexOf(l_word[0]) >= 0)
            return "an";
        else 
            return "a";
    }

    // Basic method of words that begin with a vowel being preceeded by 'an'
    if ("aeiou".indexOf(l_word[0]) >= 0)
        return "an";

    // Instances where y follwed by specific letters is preceeded by 'an'
    if (l_word.match(/^y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)/))
        return "an";

    return "a";
}
Rodrigo Neri
  • 304
  • 3
  • 4
6

What you want is to determine the appropriate indefinite article. Lingua::EN::Inflect is a Perl module that does an great job. I've extracted the relevant code and pasted it below. It's just a bunch of cases and some regular expressions, so it shouldn't be difficult to port to PHP. A friend ported it to Python here if anyone is interested.

# 2. INDEFINITE ARTICLES

# THIS PATTERN MATCHES STRINGS OF CAPITALS STARTING WITH A "VOWEL-SOUND"
# CONSONANT FOLLOWED BY ANOTHER CONSONANT, AND WHICH ARE NOT LIKELY
# TO BE REAL WORDS (OH, ALL RIGHT THEN, IT'S JUST MAGIC!)

my $A_abbrev = q{
(?! FJO | [HLMNS]Y.  | RY[EO] | SQU
  | ( F[LR]? | [HL] | MN? | N | RH? | S[CHKLMNPTVW]? | X(YL)?) [AEIOU])
[FHLMNRSX][A-Z]
};

# THIS PATTERN CODES THE BEGINNINGS OF ALL ENGLISH WORDS BEGINING WITH A
# 'y' FOLLOWED BY A CONSONANT. ANY OTHER Y-CONSONANT PREFIX THEREFORE
# IMPLIES AN ABBREVIATION.

my $A_y_cons = 'y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)';

# EXCEPTIONS TO EXCEPTIONS

my $A_explicit_an = enclose join '|',
(
    "euler",
    "hour(?!i)", "heir", "honest", "hono",
);

my $A_ordinal_an = enclose join '|',
(
    "[aefhilmnorsx]-?th",
);

my $A_ordinal_a = enclose join '|',
(
    "[bcdgjkpqtuvwyz]-?th",
);

sub A {
    my ($str, $count) = @_;
    my ($pre, $word, $post) = ( $str =~ m/\A(\s*)(?:an?\s+)?(.+?)(\s*)\Z/i );
    return $str unless $word;
    my $result = _indef_article($word,$count);
    return $pre.$result.$post;
}

sub AN { goto &A }

sub _indef_article {
    my ( $word, $count ) = @_;

    $count = $persistent_count
        if !defined($count) && defined($persistent_count);

    return "$count $word"
        if defined $count && $count!~/^($PL_count_one)$/io;

    # HANDLE USER-DEFINED VARIANTS

    my $value;
    return "$value $word"
        if defined($value = ud_match($word, @A_a_user_defined));

    # HANDLE ORDINAL FORMS

    $word =~ /^($A_ordinal_a)/i         and return "a $word";
    $word =~ /^($A_ordinal_an)/i        and return "an $word";

    # HANDLE SPECIAL CASES

    $word =~ /^($A_explicit_an)/i       and return "an $word";
    $word =~ /^[aefhilmnorsx]$/i        and return "an $word";
    $word =~ /^[bcdgjkpqtuvwyz]$/i      and return "a $word";


    # HANDLE ABBREVIATIONS

    $word =~ /^($A_abbrev)/ox           and return "an $word";
    $word =~ /^[aefhilmnorsx][.-]/i     and return "an $word";
    $word =~ /^[a-z][.-]/i              and return "a $word";

    # HANDLE CONSONANTS

    $word =~ /^[^aeiouy]/i              and return "a $word";

    # HANDLE SPECIAL VOWEL-FORMS

    $word =~ /^e[uw]/i                  and return "a $word";
    $word =~ /^onc?e\b/i                and return "a $word";
    $word =~ /^uni([^nmd]|mo)/i         and return "a $word";
    $word =~ /^ut[th]/i                 and return "an $word";
    $word =~ /^u[bcfhjkqrst][aeiou]/i   and return "a $word";

    # HANDLE SPECIAL CAPITALS

    $word =~ /^U[NK][AIEO]?/            and return "a $word";

    # HANDLE VOWELS

    $word =~ /^[aeiou]/i                and return "an $word";

    # HANDLE y... (BEFORE CERTAIN CONSONANTS IMPLIES (UNNATURALIZED) "i.." SOUND)

    $word =~ /^($A_y_cons)/io           and return "an $word";

    # OTHERWISE, GUESS "a"
    return "a $word";
}
moinudin
  • 134,091
  • 45
  • 190
  • 216
  • This looks perfect, I'm fairly new to PHP so porting will be a bit of an adventure! – MarathonStudios Dec 29 '10 at 22:50
  • This looks useful! If anybody does find the time to port it to PHP, please post it here! – Neil Dec 29 '10 at 23:34
  • I will port this over the next few days and post in my original question – MarathonStudios Dec 30 '10 at 07:38
  • The [`Lingua::EN::Inflect`](http://search.cpan.org/dist/Lingua-EN-Inflect/lib/Lingua/EN/Inflect.pm#PROVIDING_INDEFINITE_ARTICLES) Perl module was [ported to PHP by @Kaivosukeltaja](https://stackoverflow.com/a/12798746/390798) and is also available on [github](https://github.com/Kaivosukeltaja/php-indefinite-article) – Improv Apr 11 '20 at 02:25
0

I've written a PHP port of the popular JS a-vs-an code as described in this stackoverflow post https://stackoverflow.com/a/1288473/1526020.

Github page: https://github.com/UseAllFive/a-vs-an.

E.g.

$result = $aVsAn->query('0800 number');
print_r($result);

Returns

Array
(
    [aCount] => 8
    [anCount] => 25
    [prefix] => 08
    [article] => an
)
Community
  • 1
  • 1
Zach
  • 21
  • 4
0

Was looking for just such a solution so thanks marcog. Here's an attempt to port your friend's python version (I don't know python or perl so there's probably some mistakes):

function indefinite_article($word) {
    // Lowercase version of the word
    $word_lower = strtolower($word);

    // An 'an' word (specific start of words that should be preceeded by 'an')
    $an_words = array('euler', 'heir', 'honest', 'hono');
    foreach($an_words as $an_word) {
            if(substr($word_lower,0,strlen($an_word)) == $an_word) return "an";
    }
    if(substr($word_lower,0,4) == "hour" and substr($word_lower,0,5) != "houri") return "an";

    // An 'an' letter (single letter word which should be preceeded by 'an')
    $an_letters = array('a','e','f','h','i','l','m','n','o','r','s','x');
    if(strlen($word) == 1) {
            if(in_array($word_lower,$an_letters)) return "an";
            else return "a";
    }

    // Capital words which should likely by preceeded by 'an'
    if(preg_match('/(?!FJO|[HLMNS]Y.|RY[EO]|SQU|(F[LR]?|[HL]|MN?|N|RH?|S[CHKLMNPTVW]?|X(YL)?)[AEIOU])[FHLMNRSX][A-Z]/', $word)) return "an";

    // Special cases where a word that begins with a vowel should be preceeded by 'a'
    $regex_array = array('^e[uw]','^onc?e\b','^uni([^nmd]|mo)','^u[bcfhjkqrst][aeiou]');
    foreach($regex_array as $regex) {
            if(preg_match('/'.$regex.'/',$word_lower)) return "a";        
    }

    // Special capital words
    if(preg_match('/^U[NK][AIEO]/',$word)) return "a";
    // Not sure what this does
    else if($word == strtoupper($word)) {
            $array = array('a','e','d','h','i','l','m','n','o','r','s','x');
            if(in_array($word_lower[0],$array)) return "an";
            else return "a";
    }

    // Basic method of words that begin with a vowel being preceeded by 'an'
    $vowels = array('a','e','i','o','u');
    if(in_array($word_lower[0],$vowels)) return "an";

    // Instances where y follwed by specific letters is preceeded by 'an'
    if(preg_match('/^y(b[lor]|cl[ea]|fere|gg|p[ios]|rou|tt)/', $word_lower)) return "an";

    // Default to 'a'
    return "a";
}

There's one bit (below the comment "// Not sure what this does") that I was unsure of what it did. If anyone can figure it out, I'd be happy to know.

Charlie
  • 323
  • 2
  • 12
0

The problem with a rule based system is that they deal poorly with edge cases, and that they're complicated. If you can base your decisions on actual data, you'll do better. In this answer I describe how you might use wikipedia to build a lookup dictionary, and link to a (very simple) javascript implementation using such a dictionary.

A prefix-dictionary will deal fairly well with acronyms and numbers, though with some effort you could probably do better.

Community
  • 1
  • 1
Eamon Nerbonne
  • 47,023
  • 20
  • 101
  • 166
-1

Make an array with vowels in it. Check if the first letter of the word you are checking is in the vowel array. Will work except when dealing with acronyms.

profitphp
  • 8,104
  • 2
  • 28
  • 21
  • 2
    or when dealing with silent h's but you can hard code those exceptions if you must. – Tesserex Dec 29 '10 at 22:32
  • Yes, the rule for using "an" is whether or not the word starts with a vowel. I remember asking this back in gr3 after my teacher got mad at me for using them interchangeably. – mpen Dec 29 '10 at 22:33
  • I would also check if it's all capitals (if necessary) for acronyms. Some can sound strange when pronounced with an "an" (typically those starting with a vowel name, e.g. F/"ef", L/"el", etc.) – Brad Christie Dec 29 '10 at 22:34
  • Ya, long U and silent H are exceptions. Check this link, http://www.ecenglish.com/learnenglish/when-use-a, and hard code if you need to. – profitphp Dec 29 '10 at 22:36
  • @Brad Christie It gets real tricky when you start thinking about "an SQL query" when some people say "sequel" and some say "S-Q-L". Technically though, you're not supposed to pronounce the acronym as a word, so in writing a rule you'd use "an". – profitphp Dec 29 '10 at 22:38
  • 1
    http://james.cridland.net/code/indefinite_article.html essentially does the vowel solution, but as some comments here point out it doesn't handle all cases. The Perl(/Python) library I linked to does a much more thorough job. – moinudin Dec 29 '10 at 22:53
  • Quote: "Yes, the rule for using "an" is whether or not the word starts with a vowel." Wrong actually. The rule is use "an" if the word starts with a vowel sound! That is why their as so many exceptions such as silent "h" and long "u". – danielson317 Nov 07 '14 at 17:52
-1

It should be pretty easy to write from scratch, tbh. If a word starts with a vowel, it gets an 'a'; if it begins with a consonant, it gets an 'an'. Programmatically it's easy to do - if you have any edge cases (for eg you might use the BBC english-style 'an historic occasion') you can handle them individually.

Kind of like using an inflector, only with the 'a'/'an' grammar rule instead of plurals. Look into how CakePHP or Rails handle inflection for a more thorough discussion of the concept, including how to handle edge cases - you don't want to inflect 'deer' as 'deers' in the plural, for example, or 'goose' as 'gooses', so they need to be handled individually, just like your own edge cases like 'universe' or aspirated/non-aspirated 'H's.

hollsk
  • 3,124
  • 24
  • 34
  • 4
    It's a bit more complicated than just whether or not the word starts with a vowel. Its based on how the word is pronounced, not spelled. For example, we say "an hour" or "a user," not "a hour" or "an user". – mfonda Dec 29 '10 at 22:38
  • Yers, hence the 'edge cases' caveat. Those edge cases need to be put in manually unless you reckon you can invent a way to determine whether a pronouncation has a long vowel sound or not simply by analysing the following letters. Good luck with that. – hollsk Dec 29 '10 at 22:39
  • You may be able to use the soundex functions to help with that. – profitphp Dec 29 '10 at 22:41
  • For a php database web application?! – hollsk Dec 29 '10 at 22:44
  • Not saying I would go that far, but one could if so inclined. – profitphp Dec 29 '10 at 22:46