11

I am interested in validating or automatically correcting the use of the indefinite articles "a" and "an" in blocks of English text from a textarea.

The grammatical rule is that the choice of article depends on the sound that begins the next word. Details here and here. This appears incredibly broad, however there has been a suggestion in a previous answer (How can I correctly prefix a word with "a" and "an"?) to reference a huge database of English text to create the heuristics to infer the correct indefinite article to use in a given situation. Eamon Nerbonne comments that he has done this, so how can I apply that solution to this practical implementation?

The function I have so far implements the simplest part of the grammatical rule; it uses an when the following word starts with a vowel, and a otherwise. It also respects the existing capitalization of the article. In actual use, though, this isn't practical because the exceptions to that rule are very common. For example, "a horse" is correct while "a honour" and "a HTTP address" are not.

How can my function be expanded to properly handle actual pronunciation of words following the articles, including silent letters, acronyms, and "sometimes-y"? I don't require 100% accuracy - something better than 80% would be enough to improve the text I'm correcting.

Here's my fixArticles() function; see the snippet for a running example.

function fixArticles( txt ) {
  var valTxt = txt.replace(/\b(a|an) (\w*)\b/gim, function( match, article, following ) {
    var newArticle = article.charAt(0);
    switch (following.charAt(0).toLowerCase()) {
      case 'a':
      case 'e':
      case 'i':
      case 'o':
      case 'u':
        newArticle += 'n'; // an
        break;
      default:
        // a
        break;
    }
    if (newArticle !== article) {
      newArticle = "<span class='changed'>" + newArticle + "</span>";
    }
    return newArticle+' '+following;

  });

  document.getElementById('output-text').innerHTML = valTxt.replace(/\n/gm,'<br/>');
}

function fixArticles( txt ) {
  var valTxt = txt.replace(/\b(a|an) (\w*)\b/gim, function( match, article, following ) {
    var newArticle = article.charAt(0);
    switch (following.charAt(0).toLowerCase()) {
      case 'a':
      case 'e':
      case 'i':
      case 'o':
      case 'u':
        newArticle += 'n'; // an
        break;
      default:
        // a
        break;
    }
    if (newArticle !== article) {
      newArticle = "<span class='changed'>" + newArticle + "</span>";
    }
    return newArticle+' '+following;

  });
  
  document.getElementById('output-text').innerHTML = valTxt.replace(/\n/gm,'<br/>');
}
input, label {
    display:block;
}
.changed {
  font-weight: bold;
}
<label for="input-text">Enter text</label>
<textarea id="input-text" cols="50" rows="5">An wise man once said: "A apple an day keeps the doctor away."
Give me an break.
I would like an apple.
My daughter wants a hippopotamus for Christmas.
It was an honest error.
Did a user click the button?
An MSDS (material safety data sheet) was used to record the data.
</textarea>
<input type="button" value="Fix a/an" onClick="fixArticles(document.getElementById('input-text').value)">
<hr>
<div id="output-text"/>

The expected output for the sample input is:

A wise man once said: "An apple a day keeps the doctor away."
Give me a break.
I would like an apple.
My daughter wants a hippopotamus for Christmas.
It was an honest error.
Did a user click the button?
An MSDS (material safety data sheet) was used to record the data.

Community
  • 1
  • 1
Mogsdad
  • 44,709
  • 21
  • 151
  • 275
  • http://stackoverflow.com/a/1288473/1017882 –  Dec 23 '15 at 16:57
  • Tough to do, but this is really an algorithm question for a rule based system rather than a just JS and regex. Not a bad question. – SoluableNonagon Dec 23 '15 at 16:59
  • 5
    Not relevant to the coding of the question, but I'm pretty sure *"A MSDS"* isn't correct. Also, I wonder if `soundex` (code for pronunciation of a string) could be useful for this question. – Tim Lewis Dec 23 '15 at 16:59
  • @TimLewis - Good eye - that was a typo in my expected results - the point being that my current script is wrong. – Mogsdad Dec 23 '15 at 17:01
  • I figured as much, but there's a lot of argument out there for the usage of *a vs an before M*, just wanted to make sure. – Tim Lewis Dec 23 '15 at 17:03
  • This is also going to miss about half the English vocabulary starting with the letter `h` (more in England, less in the States, since the English drop their 'h's more). `an herb`, but `a hatchet`, you get the idea. – ShadowRanger Dec 23 '15 at 17:04
  • 1
    As mentioned in the answer linked to by JayMee, apart from having a list of exceptions it's practically impossible to come up with an algorithm that could accurately predict which article is the correct one to use. – JJJ Dec 23 '15 at 17:04
  • The easy fix: use `a(n)` – julian soro Dec 23 '15 at 20:23
  • @JulianSoro - Love it! Won't make the client happy, but definitely the engineering answer. – Mogsdad Dec 23 '15 at 20:24

1 Answers1

3

Following the flippant answer to How can I correctly prefix a word with "a" and "an"?, Eamon Nerbonne followed the given advice and produced an efficient algorithm that accurately identifies the correct indefinite article to use before any following text. So thanks @JayMEE for the pointer, it did actually help.

Implementation of the algorithm is outside the scope of basic Q & A - you can read about it in Eamon's blog entry and GitHub repository. However, it's dead simple to use!

Here's how fixArticles() can be modified to use the simple, minified version of Eamon's code, AvsAn-simple.min.js. See the JSFiddle Demo.

function fixArticles(txt) {
  var valTxt = txt.replace(/\b(a|an) ([\s\(\"'“‘-]?\w*)\b/gim, function(match, article, following) {
    var input = following.replace(/^[\s\(\"'“‘-]+|\s+$/g, ""); //strip initial punctuation symbols
    var res = AvsAnSimple.query(input);
    var newArticle = res.replace(/^a/i, article.charAt(0));
    if (newArticle !== article) {
      newArticle = "<span class='changed'>" + newArticle + "</span>";
    }
    return newArticle + ' ' + following;
  });

  document.getElementById('output-text').innerHTML = valTxt.replace(/\n/gm, '<br/>');
}
Community
  • 1
  • 1
Mogsdad
  • 44,709
  • 21
  • 151
  • 275
  • note that you can avoid the somewhat complex `following.replace` line by slightly tweaking your original regex: `/\b(a|an)(\s[\s\(\"'“‘-]*)(\w+)\b/` - that way the initial punctuation is a separate capture, so you don't need to further process `following`. And as a bonus, that lets you retain the spacer+initial punctuation in your replace if you `return newArticle + initialPunctuation + following;` – Eamon Nerbonne Dec 24 '15 at 11:08
  • 1
    Oh and how about `var res = article[0] + res.substr(1);`? – Eamon Nerbonne Dec 24 '15 at 11:10
  • I like the touch about marking up changed articles - nice detail! – Eamon Nerbonne Dec 24 '15 at 11:11
  • Oh, and the library already deals with punctuation, so you really shouldn't have to. It does not ignore `-` and `(` - you think it should (do you have an example in mind?) – Eamon Nerbonne Dec 24 '15 at 11:17
  • @EamonNerbonne Good points - I'll revisit my implementation. Wrt punctuation, I took the lead from your example code on GitHub, and thought you must have a reason to allow for each of those characters. I haven't seen a valid example for `-`, but plenty for `(`. I'll share my regex101 case when I get to a PC. – Mogsdad Dec 24 '15 at 11:41
  • Ah yes - well, there's a bad reason for that example code, and that's that it predates the lib's own stripping of symbols. I'll fix the example. Incidentally, the trickiness with `(` is that the library has no magic way of telling symbols from words - and wikipedia (and possibly arbitrary text) also contains numerous formulae. Now, it's possible to strip `(`, but the result is that it may well match in a formula too. – Eamon Nerbonne Dec 24 '15 at 12:27
  • I experimented with which symbols to strip during mining, and compared the output of the classifier based on those choices, and for *wikipedia* at least allowing a `(` between article and word leads to too many false positives. I'm kind of struggling how to present such nuanced choices in the library however, which is why I just picked a conservative default. – Eamon Nerbonne Dec 24 '15 at 12:31
  • @EamonNerbonne My regex test is [here](https://regex101.com/r/nE1yA4/4). The text I'm interested in is SO posts, with markdown. I identify and protect code blocks before checking prose, so in the case of formulas, enclosing them in back-ticks serves both to format them as code and to exclude them from being changed by my code. The script I'm working on is [here](https://github.com/Tiny-Giant/Stack-Exchange-Editor-Toolkit/edit/dev-mogsdad/editor.user.js), see line 1709. I'd be happy to continue this chat there. – Mogsdad Dec 24 '15 at 22:08