Extract text between paragraph tag using RegEx

Question

I try to extract text between parapgraph tag using RegExp in javascript. But it doen't work...

My pattern:

<p>(.*?)</p>

Subject:

<p> My content. </p> <img src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTJ9ylGJ4SDyl49VGh9Q9an2vruuMip-VIIEG38DgGM3GvxEi_H"> <p> Second sentence. </p>

Result :

My content

What I want:

My content. Second sentence.

[Don't parse HTML with RegEx](http://stackoverflow.com/a/1732454/361684) — gilly3, Feb 19 '13 at 23:51
You can get the body of `
` tags just fine with regex (despite the warnings against parsing generally with it), but if you're using JavaScript there's no need to since you have `document.getElementsByTagName("p")`. — Reinstate Monica -- notmaynard, Feb 19 '13 at 23:58
@iamnotmaynard - `document.getElementsByTagName()` is a DOM method. It is only available to JavaScript because the browser provides it. With node.js, there is no browser, and node.js does not natively parse HTML into a DOM. You can't assume that, just because you are using the JavaScript language, a browser DOM is available. A DOM can be made available to node.js if such a package is installed, such as [jsdom](https://npmjs.org/package/jsdom). — gilly3, Feb 20 '13 at 00:06
@gilly3, hoh no... Not that easy generic answer again -_-. Using regex for what he wants is perfectly fine. — Jean-Philippe Leclerc, Feb 20 '13 at 00:45
@Jean-PhilippeLeclerc - What about this valid html: `
Paragraph1
Paragaph2` — gilly3, Feb 20 '13 at 01:37

score 4 · Accepted Answer · answered Feb 19 '13 at 23:52

4

There is no "capture all group matches" (analogous to PHP's preg_match_all) in JavaScript, but you can cheat by using .replace:

var matches = [];
html.replace(/<p>(.*?)<\/p>/g, function () {
    //arguments[0] is the entire match
    matches.push(arguments[1]);
});

answered Feb 19 '13 at 23:52

Explosion Pills

188,624
52
326
405

Ok so, how can I do using Jade and NodeJS for extract the text between
and
? – tonymx227 Feb 19 '13 at 23:57
@tonymx227 I don't really know what you mean .. that code is just raw JavaScript, so you should be able to use it with any JS interpreter – Explosion Pills Feb 19 '13 at 23:58
Yes I know. But with controller I send to my Jade view (for example) all the posts, with my view I try to get the content of a post without tag... ${posts.content.match('/
(.*?)<\/p>/g')} but it doesn't work...
– tonymx227 Feb 20 '13 at 00:05
I don't know how to use Jade views, so I wouldn't really be able to help you there. I said to use `.replace`, not `match`, though – Explosion Pills Feb 20 '13 at 00:07
I asked a new question because it's not the same subject. But thank you anyway. – tonymx227 Feb 20 '13 at 00:19
@kilianc you could use `[\w\W]` instead of `.` for that or the `s` / dotall flag – Explosion Pills Dec 02 '15 at 14:05

score 1 · Answer 2 · answered Feb 20 '13 at 09:57

1

To get more than one match of a pattern the global flag g is added.
The match method ignores capture groups () when matching globally, but the exec method does not. See MDN exec.

var m,
    rex = /<p>(.*?)<\/p>/g,
    str = '<p> My content. </p> <img src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTJ9ylGJ4SDyl49VGh9Q9an2vruuMip-VIIEG38DgGM3GvxEi_H"> <p> Second sentence. </p>';

while ( ( m = rex.exec( str ) ) != null ) {
    console.log( m[1] );
}

//  My content. 
//  Second sentence.

If there may be newlines between the paragraphs, use [\s\S], meaning match any space or non-space character, instead of ..

Note that this kind of regex will fail on nested paragraphs as it will match up to the first closing tag.

answered Feb 20 '13 at 09:57

MikeM

13,156
2
34
47

There's no such thing as "nested paragraphs". A `
` does not require a closing tag. A block element that occurs after an open `
` tag implies a closing `
` tag. Your regexp will treat multiple paragraphs without closing tags as one single paragraph. – gilly3 Mar 07 '13 at 23:38
@gilly3. XHTML requires the closing tag and I think the OP makes it quite clear in his question he is looking for the content between opening and closing p tags. It is pretty obvious my answer assumes the closing tags and if there isn't any _the OP's_ regex (not mine) won't match anyway. Nevertheless, I think your observation is worthwhile, so thank you. – MikeM Mar 07 '13 at 23:59

Extract text between paragraph tag using RegEx

2 Answers2

Linked