6

I'm trying to write a regex that will find a string of HTML tags inside a code editor (Khan Live Editor) and give the following error:

"You can't put <h1.. 2.. 3..> inside <p> elements."

This is the string I'm trying to match:

<p> ... <h1>

This the string I don't want to match:

<p> ... </p><h1>

Instead the expected behavior is that another error message appears in this situation.

So in English I want a string that;
- starts with <p> and
- ends with <h1> but
- does not contain </p>.

It's easy enough to make this work if I don't care about the existence of a </p>. My expression looks like this, /<p>.*<h[1-6]>/ and it works fine. But I need to make sure that </p> does not come between the <p> and <h1> tags (or any <h#> tag, hence the <h[1-6]>).


I've tried a lot of different expressions from some other posts on here:

Regular expression to match a line that doesn't contain a word?

From which I tried: <p>^((?!<\/p>).)*$</h1>

regex string does not contain substring

From which I tried: /^<p>(?!<\/p>)<h1>$/

Regular expression that doesn't contain certain string

This link suggested: aa([^a] | a[^a])aa

Which doesn't work in my case because I need the specific string "</p>" not just the characters of it since there might be other tags between <p> ... <h1>.


I'm really stumped here. The regex I've tried seems like it should work... Any idea how I would make this work? Maybe I'm implementing the suggestions from other posts wrong?

Thanks in advance for any help.

Edit:

To answer why I need this done:

The problem is that <p><h1></h1></p> is a syntax error since h1 closes the first <p> and there is an unmatched </p>. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception.

Community
  • 1
  • 1
Dan Fletcher
  • 1,099
  • 11
  • 29
  • Exactly. So the problem is that `

    ` is a syntax error since `h1` closes the first `

    ` and there is an unmatched `

    `. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception.
    – Dan Fletcher Nov 24 '15 at 18:53
  • This has nothing to do with your regex question, but it is actually correct and fine to have html content that contains an

    ,

    , etc before an explicit

    as, in HTML5 (which has this flow-content rule) the is completely optional. For instance: `

    Paragraph 1.

    Paragraph 2.

    Heading

    Paragraph 3.` Is completely valid HTML5 and can be authored as such intentionally.

    – rgthree Nov 24 '15 at 18:57
  • Should we assume you don't ever have attributes or whitespace in the tags? – Alan McBee Nov 24 '15 at 19:03
  • @AlanMcBee Yes that's true. – Dan Fletcher Nov 24 '15 at 19:05
  • @rgthree The problem is that users are trying this: `

    ` Which is not valid because the `` has no matching `

    `.

    – Dan Fletcher Nov 24 '15 at 19:06
  • @Teemu It's not worded well, but that's not _actually_ what it says. It says: _"The end tag may be omitted if the

    element is immediately followed by..."_. The key there is the "

    element" which means _the entire paragraph_ and not the opening p tag. Essentially, the authored content `

    Paragraph 2.

    Heading

    ` is valid as the paragraph is intentionally ending at of `Paragraph 2.` when the `

    ` element intentionally ends the

    element and starts a new flow content block.

    – rgthree Nov 24 '15 at 19:07
  • @DanFletcher I understand why it's desired to catch this and am not arguing that your implementation need to change. Just commenting that `

    ` is _not actually_ an HTML5 syntax error, as it will be rendered in the browser correctly (and intentionally) as `

    `. If you want to be explicit, you will need to catch _all of the elements_ that end a paragraph without the optional . You can find that in the spec here: http://www.w3.org/TR/html5/grouping-content.html#the-p-element
    – rgthree Nov 24 '15 at 19:12
  • @rgthree Hmm... Looks like you're right. It's really _element_ , not the opening tag. – Teemu Nov 24 '15 at 19:12
  • HTML correctness aside, the solution is still a little ugly. For example, what about these tests: `test1

    para

    head

    para2end` which I think you want to match (as an error), and `test2

    para

    head

    para2

    end` which you would not want to match? Are these possibilities?
    – Alan McBee Nov 24 '15 at 19:17
  • @AlanMcBee Yep that's true. Maybe I should be more specific for what I'm checking for? My first thought was to do that but I was concerned about performance inside the live editor. – Dan Fletcher Nov 24 '15 at 19:20
  • @rgthree ahh I see now - you're definitely right. I think our syntax parser reports a syntax error because it won't validate. You'll get " No p element in scope but a p end tag seen." at https://validator.w3.org/. The parser is built from SlowParse btw. – Dan Fletcher Nov 24 '15 at 19:24
  • 1
    @DanFletcher You said that RegEx is your only option. However, you can cheat your validator and pass a RegEx from an IIFE in argument list, and utilize Niet the Dark Absol's code. Please [check a fiddle](http://jsfiddle.net/8hgLgvrb/). – Teemu Nov 24 '15 at 19:45
  • @Teemu Oh nice! I didn't think of doing that - I guess that work too eh? Thank you! – Dan Fletcher Nov 24 '15 at 19:59

5 Answers5

6

Sometimes it's better to break a problem down.

var str = "YOUR INPUT HERE";
str = str.substr(str.indexOf("<p>"));
str = str.substr(0,str.lastIndexOf("<h1>"));
if( str.indexOf("</p>") > -1) {
    // there is a <p>...</p>...<h1>
}
else {
    // there isn't
}

This code doesn't handle the case of "what if there is no <p> to begin with" very well, but it does give a basic idea of how to break a problem down into simpler parts, without using regex.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • 3
    If it can be done without regex (Without adding too much complexity), then it should be done. +1 – Blue Nov 24 '15 at 18:50
  • Thank you. In this situation - at least for now - regex is my only option. – Dan Fletcher Nov 24 '15 at 18:58
  • This is actually a viable solution for my problem. As @Teemu pointed out to me, I could pass my validator a IIFE and this would work. Thanks again! – Dan Fletcher Nov 24 '15 at 20:01
3

Search for <p> followed by any number of characters ([^] means any character that is not nothing, this allows us to also capture newlines) that are not followed by </p> which is eventually followed by <h[1-6]>.

/<p>(?:[^](?!<\/p>))*<h[1-6]>/gi

RegEx101 Test Case

enter image description here

const strings = [ '<p> ... <h1>', '<p> ... </p><h1>', '<P> Hello <h1>', '<p></p><h1>',
                  '<p><h1>' ];

const regex = /<p>(?:(?!<\/p>)[^])*<h[1-6]>/gi;

const test = input => ({ input, test: regex.test(input), matches: input.match(regex) });

for(let input of strings) console.log(JSON.stringify(test(input)));

// { "input": "<p> ... <h1>",     "test": true,  "matches": ["<p> ... <h1>"]   }
// { "input": "<p> ... </p><h1>", "test": false, "matches": null               }
// { "input": "<P> Hello <h1>",   "test": true,  "matches": ["<P> Hello <h1>"] }
// { "input": "<p></p><h1>",      "test": false, "matches": null               }
// { "input": "<p><h1>",          "test": true,  "matches": ["<p><h1>"]        }
.as-console-wrapper { max-height: 100% !important; min-height: 100% !important; }
2

Your first regular expression was close, but needed to remove the ^ and $ characters. If you need to match across newlines, you should use [/s/S] instead of ..

Here's the final regex: <p>(?:(?!<\/p>)[\s\S])*<h[1-6]>

However, having a header tag (<h1> - <h6>) is perfectly legal inside a paragraph element. They're just considered sibling elements, with the paragraph element ending where the header element begins.

A p element’s end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, dir, div, dl, fieldset, footer, form, h1, h2, h3, h4, h5, h6, header, hr, menu, nav, ol, p, pre, section, table, or ul element, or if there is no more content in the parent element and the parent element is not an a element.

http://www.w3.org/TR/html-markup/p.html

Pluto
  • 2,900
  • 27
  • 38
  • Wow! Thank you so much! This works better than I need it to :) The reason we catch

    btw is because it shouldn't pass a validation and we are trying to teach good practices. Thanks again.
    – Dan Fletcher Nov 24 '15 at 19:48
1

I'm reaching the conclusion that using a regular expression to find the error is going to turn your one problem into two problems.

Consequently, I think a better approach is to do a very simplistic form of tree parsing. A "poor-man's HTML parser", if you will.

Use a simple regular expression to simply find all tags in the HTML, and put them into a list in the same order in which they were found. Ignore the text nodes between the tags.

Then, walk through the list in order, keeping a running tally on the tags. Increment the P counter when you get a <p> tag, and decrement it when you get a </p> tag. Increment the H counter and the H counter when you get to a <h1> (etc.) tag, decrement on the closing tag.

If the H counter is > 0 while the P counter is > 0, that's your error.

Alan McBee
  • 4,202
  • 3
  • 33
  • 38
-2

I know im not formatting it correctly but I think the logic will work,

(just replace the AND and NOT with the correct symbols):

/(<p>.*<h[1-6]>)AND !(<p>.*</p><h[1-6]>)/

Let me know how it goes :)

Chris Conaty
  • 93
  • 1
  • 1
  • 10
  • Thanks, but if it was that simple I would have done that. It's not understanding the logic of it that I'm struggling with, it's implementing that logic into a regex. – Dan Fletcher Nov 24 '15 at 19:00