1

I have a document containing lots of paragraphs. Some of these are subheadings, which are identifiable because they do not end with a full stop, like this:

<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>

I want to make the titles go into an h3 tag but not the sentences. So I need to find and replace all paragraphs not ending in a full stop. I need to do this with javascript I have tried the following but each fails. In each case the text is first read into a variable called body.

body = body.replace(/<p>(.*?)(?!\.)<\/p>/gi, "<h3>$1</h3>");

That just makes everything bold

This would work, I think:

body = body.replace(/<p>(.*?)(?<!\.)<\/p>/gi, "<h3>$1</h3>");

but javascript does not recognise negative look behind.

Any ideas how I do this?

DrBloke
  • 21
  • 4
  • 1
    Instead of trying to use regexp on HTML, which is always a slippery slope, I would pull out all the `p` elements, check their content, and add a class to the ones that don't end in a full stop. –  Aug 24 '15 at 16:46
  • Do you mean this? `

    ([^\.]*?)<\/p>`

    –  Aug 24 '15 at 17:04
  • Thanks Washington, I think you're right except I don't think you need to escape in a character class. So: `

    ([^.]*?)<\/p>`

    – DrBloke Aug 25 '15 at 11:32
  • Note also, Washington, that this doesn't capture this title `

    1.2 million people like regex

    ` but as I mention below, this doesn't really matter to me.
    – DrBloke Aug 25 '15 at 11:57

2 Answers2

3

You could do the replacement paragraph per paragraph, which would be cleaner that doing a regex on the whole HTML:

[].forEach.call(document.getElementsByTagName('p'), function(p){
     if (!/[.?!]\s*$/.test(p.innerHTML)) p.outerHTML="<h3>"+p.innerHTML+"</h3>";
});
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>You want to handle questions, right?</p>
<p>I'm sure you do!</p>
<p>This is a title containing 1.2 million</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>

This way there's no problem if your HTML evolves (will you really always have only P elements?).

Denys Séguret
  • 372,613
  • 87
  • 782
  • 758
  • Why do you need the `\s*` part of the expression ? –  Aug 24 '15 at 17:05
  • @sln It's very frequent to have unwanted spaces in HTML (often they're new lines at the start or end). – Denys Séguret Aug 24 '15 at 17:17
  • Ok, but I just don't see whitespace have anything to do with `[.?!]` like who care's if there is optional space. –  Aug 24 '15 at 19:27
  • Thanks Denys. I think this will be the way to go. I need to adapt it to work on my variable as I'm not working on an html document, but that should be easy enough. I will mark this as answered once I've confirmed I can do that. Note: your code won't capture this title: `

    1.2 million people don't care

    ` even though it doesn't end with a full stop. But such an exception will be rare, so I can live with it.
    – DrBloke Aug 25 '15 at 10:47
  • For those that are interested, this is an adaption of Denys' code that will work on the text string variable I explain I am using in the question. Note: this does not work on the exception I define in the comment above. I'd be interested if anyone can think of a way to capture this exception too, but I don't require it. body = body.replace(/

    ([^.?!]*?)<\/p>/gi, "

    $1

    ");
    – DrBloke Aug 25 '15 at 11:32
  • @DrBloke I realize after reading your comment that I forgot a $. You're probably more interested in this fixed regex. – Denys Séguret Aug 25 '15 at 19:52
1

You're over thinking it. Keep it simple!

body = body.replace(/<p>(.*?[^.])<\/p>/gi, "<h3>$1</h3>");
//                          ^^^^

No need for the lookarounds, just match a non-period character at the end of a 0+ dot-match-all.

Note: I would use Denys' solution (which I +1'd) since regex isn't a good idea for HTML.


Update:

Check out this expression:

<p>((?:.(?!\.))*?)<\/p>

This lazily loops through a non-capturing group containing a negative lookahead 0+ times. The only exception here is it doesn't check the first character for a period (since there is one initial dot-match-all), but this can be fixed with a lookahead at the beginning:

<p>((?=[^.])(?:.(?!\.))*?)<\/p>
Community
  • 1
  • 1
Sam
  • 20,096
  • 2
  • 45
  • 71
  • Care to comment -1? I realize regular expressions aren't the best way to solve this, but I still think OP should know how to fix their expression. – Sam Aug 24 '15 at 16:41
  • 1
    I agree that this doesn't seem to warrant a -1 (a small warning about HTML and regexes would make it better, though) – Denys Séguret Aug 24 '15 at 16:46
  • Thanks Sam. However, I think this would replace `

    Sentence.

    Title

    Sentence.

    ` with this `

    Sentence.

    Title

    Sentence.

    ` This may be my fault as I have presented the text in multi-line format for clarity, and in that case your code seems to works. In reality all the text is on one line.
    – DrBloke Aug 25 '15 at 10:43