javascript regex paragraph not ending with full stop

Question

I have a document containing lots of paragraphs. Some of these are subheadings, which are identifiable because they do not end with a full stop, like this:

<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>

I want to make the titles go into an h3 tag but not the sentences. So I need to find and replace all paragraphs not ending in a full stop. I need to do this with javascript I have tried the following but each fails. In each case the text is first read into a variable called body.

body = body.replace(/<p>(.*?)(?!\.)<\/p>/gi, "<h3>$1</h3>");

That just makes everything bold

This would work, I think:

body = body.replace(/<p>(.*?)(?<!\.)<\/p>/gi, "<h3>$1</h3>");

but javascript does not recognise negative look behind.

Any ideas how I do this?

Instead of trying to use regexp on HTML, which is always a slippery slope, I would pull out all the `p` elements, check their content, and add a class to the ones that don't end in a full stop. — , Aug 24 '15 at 16:46
Thanks Washington, I think you're right except I don't think you need to escape in a character class. So: `
([^.]*?)<\/p>` — DrBloke, Aug 25 '15 at 11:32
Note also, Washington, that this doesn't capture this title `
1.2 million people like regex
` but as I mention below, this doesn't really matter to me. — DrBloke, Aug 25 '15 at 11:57

Denys Séguret · Accepted Answer · 2015-08-26T19:36:35.133

3

You could do the replacement paragraph per paragraph, which would be cleaner that doing a regex on the whole HTML:

[].forEach.call(document.getElementsByTagName('p'), function(p){
     if (!/[.?!]\s*$/.test(p.innerHTML)) p.outerHTML="<h3>"+p.innerHTML+"</h3>";
});

<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>You want to handle questions, right?</p>
<p>I'm sure you do!</p>
<p>This is a title containing 1.2 million</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a title</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>
<p>This is a sentence.</p>

This way there's no problem if your HTML evolves (will you really always have only P elements?).

edited Aug 26 '15 at 19:36

answered Aug 24 '15 at 16:39

Denys Séguret

372,613
87
782
758

Why do you need the `\s*` part of the expression ? – Aug 24 '15 at 17:05
@sln It's very frequent to have unwanted spaces in HTML (often they're new lines at the start or end). – Denys Séguret Aug 24 '15 at 17:17
Ok, but I just don't see whitespace have anything to do with `[.?!]` like who care's if there is optional space. – Aug 24 '15 at 19:27
Thanks Denys. I think this will be the way to go. I need to adapt it to work on my variable as I'm not working on an html document, but that should be easy enough. I will mark this as answered once I've confirmed I can do that. Note: your code won't capture this title: `
1.2 million people don't care
` even though it doesn't end with a full stop. But such an exception will be rare, so I can live with it. – DrBloke Aug 25 '15 at 10:47
For those that are interested, this is an adaption of Denys' code that will work on the text string variable I explain I am using in the question. Note: this does not work on the exception I define in the comment above. I'd be interested if anyone can think of a way to capture this exception too, but I don't require it. body = body.replace(/
([^.?!]*?)<\/p>/gi, "
$1
"); – DrBloke Aug 25 '15 at 11:32
@DrBloke I realize after reading your comment that I forgot a $. You're probably more interested in this fixed regex. – Denys Séguret Aug 25 '15 at 19:52

score 1 · Answer 2 · edited May 23 '17 at 12:23

1

You're over thinking it. Keep it simple!

body = body.replace(/<p>(.*?[^.])<\/p>/gi, "<h3>$1</h3>");
//                          ^^^^

No need for the lookarounds, just match a non-period character at the end of a 0+ dot-match-all.

_{Note: I would use Denys' solution (which I +1'd) since regex isn't a good idea for HTML.}

Update:

Check out this expression:

<p>((?:.(?!\.))*?)<\/p>

This lazily loops through a non-capturing group containing a negative lookahead 0+ times. The only exception here is it doesn't check the first character for a period (since there is one initial dot-match-all), but this can be fixed with a lookahead at the beginning:

<p>((?=[^.])(?:.(?!\.))*?)<\/p>

edited May 23 '17 at 12:23

Community

1
1

answered Aug 24 '15 at 16:36

Sam

20,096
2
45
71

Care to comment -1? I realize regular expressions aren't the best way to solve this, but I still think OP should know how to fix their expression. – Sam Aug 24 '15 at 16:41
1

I agree that this doesn't seem to warrant a -1 (a small warning about HTML and regexes would make it better, though) – Denys Séguret Aug 24 '15 at 16:46
Thanks Sam. However, I think this would replace `
Sentence.
Title
Sentence.
` with this `
Sentence.
Title
Sentence.
` This may be my fault as I have presented the text in multi-line format for clarity, and in that case your code seems to works. In reality all the text is on one line. – DrBloke Aug 25 '15 at 10:43

javascript regex paragraph not ending with full stop

2 Answers2

$1

Sentence.