0

I've made a function strConv() to run a number of replace() methods over text input from the user. It has a regex for each modification:

List of Replacements

  • Smart Quotes
     - Replace the straight quotes `'` and `"` with `‘`,`’` and `“`,`”`
  • Em Dashes
     - Replace `--` with ` — `
  • Ellipsis
     - Replace `...` with `…`
  • Ordinals
      - Replace the suffix of all ordinals with a superscript equivalent.
      - ex. `1st` to `1<sup>st</sup>` or `20th` to `20<sup>th</sup>`
  • Single Digits
      - Any occurrence of a single digit number will be converted to it's word equivalent.
      - ex. `1` to `one` or `7` to `seven`
  • Court Decision Titles
      - If there are any patterns like this:
      - `<i>`ONE OR MORE WORDS`</i>` v. `<i>`ONE OR MORE WORDS`</i>`
      - Remove the 2nd and 3rd tag:
      - `<i>`ONE OR MORE WORDS v. ONE OR MORE WORDS`</i>`
  • United States Abbreviation
      - Replace `U.S.` with `US`
  • Percentages
      - Replace `%` with ` percent`

I'm getting the wrong results in two places: ordinals and court decision titles. The reason why I'm including all the regex is because my problem may stem from the order they are arranged and how one of them effects another's results.

You'll find the actual regex in the MCVE, a test sample of text to input, and a list of the expected results that can be compared with the results. Just click the PROCESS button. Thank you for your valuable time.

MCVE

<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width,initial-scale=1, user-scalable=no">
  <title>strConv</title>
  <style>
    section {
      width: 90vw;
      min-height: 250px;
      border: 3px ridge grey;
      padding: 10px;
      margin: 30px auto;
    }
    button {
      display: block;
      margin: 0 auto;
      font-size: 24px;
    }
    dt {
      color: blue;
    }
  </style>
</head>

<body>
  <header>

  </header>
  <section id='editor1' contenteditable="true">
    "double quotes"
    <br>'single quotes'
    <br>em--dash
    <br>ellipsis...
    <br>19th
    <br>1st
    <br>fourth
    <br>1
    <br>2 9 23
    <br>
    <i>Roe</i> v. <i>Wade</i>
    <br>U.S..
    <br>%
    <br>
  </section>
  <button id='button1'>PROCESS</button>
  <section id='display1'></section>

  <footer>
    <h3>The content in the brackets [] is what is expected</h3>
    <dl>
      <dt>Smart Quotes: PASS</dt>
      <dd>"double quotes" [“double quotes”]</dd>
      <dd>'single quotes' [‘single quotes’]</dd>
      <dt>Em Dash: PASS</dt>
      <dd>em--dash [em — dash]</dd>
      <dt>Ellipsis: PASS</dt>
      <dd>ellipsis... [ellipsis…]</dd>
      <dt><mark>Ordinals: FAIL</mark></dt>
      <dd>19th [19<sup>th</sup>]</dd>
      <dd>
        <mark>1st [1<sup>st</sup>]</mark>
      </dd>
      <dd>fourth [fourth]</dd>
      <dt>Single Digits: PASS?</dt>
      <dd>1 [one]</dd>
      <dd>2 9 23 [two nine 23]</dd>
      <dt><mark>Court Decision Titles: FAIL</mark></dt>
      <dd>
        <mark><i>Roe</i> v. <i>Wade</i> [<i>Roe v. Wade</i>]</mark>
      </dd>
      <dt>United States Abbreviation: PASS</dt>
      <dd>U.S.. [US.]</dd>
      <dt>Percentages: PASS</dt>
      <dd>% [ percent]</dd>

    </dl>
  </footer>
  <script>
    document.getElementById('button1').addEventListener('click', stringUI, false);

    function stringUI() {
      var editor = document.getElementById('editor1');
      var content = editor.innerText;
      var result = strConv(content);
      var article = document.createElement('article');
      article.innerText = result;
      document.getElementById('display1').appendChild(article);
    }

    function strConv(str) {
      // Smart Quotes 
      str = str.replace(/(^|[-\u2014/(\[{"\s])'/g, "$1\u2018");
      str = str.replace(/'/g, "\u2019");
      str = str.replace(/(^|[-\u2014/(\[{\u2018\s])"/g, "$1\u201c");
      str = str.replace(/"/g, "\u201d");
      // Em Dashes                       
      str = str.replace(/--/g, " \u2014 ");
      // Ellipsis
      str = str.replace(/\.\.\./g, "\u2026");
/*FAIL*/// Ordinals
      str = str.replace(/\b([10-9]{1,3})(th|nd|rd|st)\b/g, "$1<sup>$2<\/sup>");
      // Single Digits
      str = str.replace(/\b1\b/g, "one");
      str = str.replace(/\b2\b/g, "two");
      str = str.replace(/\b3\b/g, "three");
      str = str.replace(/\b4\b/g, "four");
      str = str.replace(/\b5\b/g, "five");
      str = str.replace(/\b6\b/g, "six");
      str = str.replace(/\b7\b/g, "seven");
      str = str.replace(/\b8\b/g, "eight");
      str = str.replace(/\b9\b/g, "nine");
/*FAIL*/// Court Decision Titles
      str = str.replace(/(<i>\w.*)<\/i>(\s\bv\.\b\s)<i>(\w.*<\/i>)/g, "$1$2$3");
      // United States Abbreviation
      str = str.replace(/\bU\.S\.\b|\bU\.S\.(\.)/g, "US$1");
      // Percentages
      str = str.replace(/%/g, " percent");
      return str;
    }
  </script>
</body>

</html>
Community
  • 1
  • 1
zer00ne
  • 41,936
  • 6
  • 41
  • 68
  • Why `1st` is replaced with `onest` should be quite obvious. – Bergi Jan 12 '17 at 03:07
  • 1
    `content = editor.innerText;` will not include the `` tags. Simple add some `console.log(str)` lines (or `debugger;` statements) to your code and you'll see where it goes wrong. – Bergi Jan 12 '17 at 03:15
  • @Bergi `onest` is obvious now. It should've occurred to me expecting tags when I'm asking for text, thanks for your time. – zer00ne Jan 12 '17 at 03:58

1 Answers1

1

This:

str = str.replace(/\b1\b/g, "one");

will replace '1' with 'one' in any case where the number '1' is surrounded by word boundaries. A word boundary is any non-word character. Since the ordinal matching occurs first, this:

1st

becomes this:

1<sup>st</sup>

The number '1' in that text is surrounded by non-word characters so it is matched and converted to

one<sup>st</sup>

You can fix this by change the order so that the ordinal conversion after the single digit conversion.

This:

str = str.replace(/(<i>\w.*)<\/i>(\s\bv\.\b\s)<i>(\w.*<\/i>)/g, "$1");

does not work for the following reasons:

  1. This bit (\s\bv\.\b\s) does not match because '.' is not a word character, so '.' followed by a space is not a word boundary. In any case you do not need the \b since you have already specified there are spaces before and after.

  2. In any case, the input text does not contain the <i> and </i>! This is because you have used innerText to obtain the editor content instead of innerHTML.

  3. The replacement is wrong, it only includes the first capture group. It should be $1 $2 $3.

Finally I would like to warn you against trying to use regexes to parse HTML. If you are parsing a well known subset of HTML it can work OK, but it will easily break. See the famous HTML regex parsing answer!

Community
  • 1
  • 1
harmic
  • 28,606
  • 5
  • 67
  • 91
  • 1. I replaced `innerText` with `innerHTML` 2. placed the ordinals after the single digits and 3. removed the `\b` and added `$2 $3`. It works perfect, thank you very much, the time to process and post this answer boggles my mind, you are awesome, thank you. – zer00ne Jan 12 '17 at 03:55