0

Long story short, I have a string in JavaScript that contains unknown HTML code. I want to test whether or not that string contains ONLY items from the following list:

  • <p> tags
  • </p> tags
  • whitespace/newlines
  • &nbsp; characters

If the string contains anything that is none of the above, I want false, if the string contains only the above (or nothing at all) I want true.

The complicating factor is that I want this to work regardless of how many times those 4 elements show up or in what order. The only logical way I can think of to do it is to put them as non-capturing groups inside a character class, but I don't think that works. Is there another way to match an arbitrary combination of those 4 elements?

EDIT: For those of you saying this shouldn't be done because I'm parsing HTML with regex, I can state it in a form that doesn't mention HTML:

I have a string containing an unknown sequence of words and whitespace characters. I want to test if it does not contain any words that are not "foo", "bar", or some combination thereof ("foobar", "barfoofoobar", etc.).

  • " foobar barfoo bar foo " - pass
  • " foobar barfoo bar food" - fail
  • " foobar barfo bar foo " - fail
  • 1
    regular expressions cannot handle irregular text. use a DOM parser. – Marc B Jul 16 '15 at 15:39
  • dave, that looks unrelated. That's a question about matching a single HTML tag, mine is about testing whether a string contains anything that isn't in the group of things I am looking for. Furthermore the answers to that question are unrelated to the question. – DroidFreak36 Jul 16 '15 at 15:48
  • The point is you shouldn't try to use RegEx to parse HTML. You're working in JavaScript--use the DOM instead. – Dave Jul 16 '15 at 15:54
  • I am working in a situation (TinyMCE plugin) where I don't have access to the HTML I am editing directly, only by grabbing it as a string. I am not aware of any way to treat the string as a DOM. Also, I don't think it's up to StackOverflow to decide what you should and should not do (unless that's the question), it should help you find out how to do it. I didn't ask whether or not to use regex on HTML, I asked how to use regex in a certain way. – DroidFreak36 Jul 16 '15 at 16:01

2 Answers2

1

Using "DOM parser" as suggested by Marc B is not as difficult as you may think. If your environment is browser, you could let it do the hard work of building that DOM for you and just look at the result:

function checkHTMLstring(code) {
  var fragment = document.createElement('div');
  fragment.innerHTML = code;
  var elems = fragment.getElementsByTagName('*');
  var i = -1,
    elem;
  while (elem = elems[++i]) {
    if (elem.tagName.toLowerCase() != 'p') {
      return false
    }
  }
  return true
}
<button onclick="alert(checkHTMLstring(prompt('enter code','foo<p>bar</p>baz')))">test</button>
myf
  • 9,874
  • 2
  • 37
  • 49
  • I am not sure if that works in my situation, since I'm in a TinyMCE plugin. Theoretically it would run in a browser, but I'm not sure exactly how TinyMCE integrates with the browser and its plugins. Also it looks like that code ignores text content. That is, in `

    text here

    ` it would let it pass since there are no non-`

    ` tags, but I want it to fail in that case.

    – DroidFreak36 Jul 16 '15 at 16:22
0

I think I may have come up with an answer to my own question right after posting it, using the | operator. If I am correct, /^(?:<p[^>]*>|<\/p[^>]>|\s|&nbsp;)*$/i should match what I want.

<p[^>]*>|<\/p[^>]>|\s|&nbsp; should match any one thing on that list, and putting it in a non capturing group allows me to use * on it.

  • Actually, this would match any tag starting with `

    ` but I think that's fine in this case. The main point was figuring out how to match any permutation of the 4 options.

    – DroidFreak36 Jul 16 '15 at 16:41