Collect first 3 paragraphs in html using javascript

Question

Assume an article generated my markdown which has 1-N paragraphs in it. My brain is a bit fried tonight, all I could come up with was

var chunks = s.split('</p>');
if ( chunks.length > 3)
{
    s = chunks[1]+'</p>'+chunks[2]+'</p>'+chunks[3]+'</p>';
}

Is there a more sane way to collect the first three paragraphs into a string? The markdown processor guarantees the paragraphs should be legal HTML. But I'm sure there must be a more clever regex solution. Also this won't guarantee three paragraphs if there is something else like a but that's OK.

If the html is in good xhtml format, why not use xpath? – nomistic Apr 17 '15 at 01:54 — nomistic, Apr 17 '15 at 01:54

Rick Hitchcock · Accepted Answer · 2015-04-17T13:02:29.567

3

Something like this?

var s= '<p>Paragraph 1</p><p>Paragraph <em>2</em></p><p>Paragraph 3</p><p>Paragraph 4</p><p>Paragraph 5</p>';

s= (s.split('</p>')
    .splice(0,3)
    .join('</p>') +
    '</p>'
   ).replace(/\<\/p> *\<\/p>/g,'</p>');

console.log(s);

edited Apr 17 '15 at 13:02

answered Apr 17 '15 at 01:13

Rick Hitchcock

35,202
5
48
79

I like this one, though not sure what splice does if there are only 2. I could always split it into two lines and check beforehand. – ahwulf Apr 17 '15 at 12:36
Ah, excellent point. It would create an extra closing ``. Now updated to deal with that condition. – Rick Hitchcock Apr 17 '15 at 13:03

score 1 · Answer 2 · edited May 23 '17 at 10:10

Id use something used to handling the DOM.... say jQuery

var arrP = $("body p").slice(0,3);
var strP = "";
for(var i = 0; i < arrP.length; i++)  
{
  strP += arrP[i].outerHTML;
}
console.log(strP);

//Or Taking the article in as a string
var strArticle = "<p>Parra <em>1</em></p><p>Parra <strong>2</strong></p><p>Parra 3</p><p>Parra 4</p>";
var divArticle = document.createElement('div');
divArticle.innerHTML = strArticle;

arrP = $(divArticle).find("p").slice(0,3);
strP = "";
for(var i = 0; i < arrP.length; i++)  
{
  strP += arrP[i].outerHTML;
}
console.log(strP);

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<p>Parra <em>1</em></p>
<p>Parra <strong>2</strong></p>
<p>Parra 3</p>
<p>Parra 4</p>
<div id="target"></div>

I wouldn't just use jQuery for this, but if you're already using it or looking for an excuse to use it, this is an option. Otherwise go for Ricks answer. Regex is only appropriate for parsing HTML given very tight control over the input. Some would say it should never be used.

Or vanilla Javascript

var arrP = document.body.getElementsByTagName("p");
var strP = ""
for(var i = 0; i < 3; i++)
  {
    strP += arrP[i].outerHTML;
  }

console.log(strP);

//Or Taking Article body as a string 
var strArticle = "<p>Parra <em>1</em></p><p>Parra <strong>2</strong></p><p>Parra 3</p><p>Parra 4</p>";
var divArticle = document.createElement('div');
arrP = document.body.getElementsByTagName("p");
strP = ""
for(var i = 0; i < 3; i++)
  {
    strP += arrP[i].outerHTML;
  }

console.log(strP);

<p>Parra <em>1</em></p>
<p>Parra <strong>2</strong></p>
<p>Parra 3</p>
<p>Parra 4</p>
<div id="target"></div>

I should have mentioned this is server side, not client. – ahwulf Apr 17 '15 at 12:34 — ahwulf, Apr 17 '15 at 12:34

score 0 · Answer 3 · answered Apr 17 '15 at 01:51

0

There is a one line regular expression, of course, but it is pretty hard to read.

var s= '<p>Paragraph 1</p><p>Paragraph <em>2</em></p><p>Paragraph 3</p><p>Paragraph 4</p><p>Paragraph 5</p>';

regex = /(?:\<p\>.*?\<\/p\>){3}/;
s = regex.exec(s);
console.log(s);

The regular expression matches some non-capturing group exactly three times. Digging into the non-capturing group, we see several characters have to be escaped and that we need to use a lazy quantifier. I'd take your way over the clever regex any day.

answered Apr 17 '15 at 01:51

Steve Clanton

4,064
3
32
38

You need to be very careful with `.*` as it will match *everything*. Would be better to check for something like this for the close paragraph tag `^[<]\/p\>` – Kyle Falconer Apr 17 '15 at 02:22
I have been very careful. The lazy quantifier prevents matching everything. While the greedy .* will match everything, the .*? matches the minimum possible. – Steve Clanton Apr 17 '15 at 02:28
This works very well for the given string, but it will fail if there is white-space between the paragraphs, or if there are fewer than 3 paragraphs. – Rick Hitchcock Apr 17 '15 at 13:12
Maybe I am misinterpreting what was intended in the sentence with the phrase "guarantee three paragraphs," but I don't that it matters. If a reader is okay with the escaping, non-capturing groups, and lazy vs. greedy, changing to a {1,3} or adding a \s* is trivial. If not, the point that the "clever" regular expression is not the right direction still comes through. – Steve Clanton Apr 17 '15 at 21:46

score 0 · Answer 4 · answered Apr 17 '15 at 02:25

You could get the paragraphs and just concat them together until you reach three.

var pars = '';

//Get the p tags, go through some of them. Use your favorite library to do this. 
Array.prototype.some.call(document.querySelectorAll('p'), function(current, index) {  
  console.log("This should only go to 2", index);
  pars = pars + ['<p>', current.innerHTML, '</p>'].join(''); 
  return index >= 2; //Counting by zero
});

console.log(pars);

<p>This is one.</p>
<p>This is two.</p>
<p>This is five, er, three.</p>
<p>FOUR</p>
<p>FOUR PLUS ONE</p>
<p>FOUR PLUS TWO</p>

Collect first 3 paragraphs in html using javascript

4 Answers4