1

I try to extract text between parapgraph tag using RegExp in javascript. But it doen't work...

My pattern:

<p>(.*?)</p>

Subject:

<p> My content. </p> <img src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTJ9ylGJ4SDyl49VGh9Q9an2vruuMip-VIIEG38DgGM3GvxEi_H"> <p> Second sentence. </p>

Result :

My content

What I want:

My content. Second sentence.
tonymx227
  • 5,293
  • 16
  • 48
  • 91
  • 3
    [Don't parse HTML with RegEx](http://stackoverflow.com/a/1732454/361684) – gilly3 Feb 19 '13 at 23:51
  • 1
    You can get the body of `

    ` tags just fine with regex (despite the warnings against parsing generally with it), but if you're using JavaScript there's no need to since you have `document.getElementsByTagName("p")`.

    – Reinstate Monica -- notmaynard Feb 19 '13 at 23:58
  • @iamnotmaynard - `document.getElementsByTagName()` is a DOM method. It is only available to JavaScript because the browser provides it. With node.js, there is no browser, and node.js does not natively parse HTML into a DOM. You can't assume that, just because you are using the JavaScript language, a browser DOM is available. A DOM can be made available to node.js if such a package is installed, such as [jsdom](https://npmjs.org/package/jsdom). – gilly3 Feb 20 '13 at 00:06
  • @gilly3 Ah, I see. Was not aware of that. – Reinstate Monica -- notmaynard Feb 20 '13 at 00:07
  • @gilly3, hoh no... Not that easy generic answer again -_-. Using regex for what he wants is perfectly fine. – Jean-Philippe Leclerc Feb 20 '13 at 00:45
  • @Jean-PhilippeLeclerc - What about this valid html: `

    Paragraph1

    Paragaph2`

    – gilly3 Feb 20 '13 at 01:37

2 Answers2

4

There is no "capture all group matches" (analogous to PHP's preg_match_all) in JavaScript, but you can cheat by using .replace:

var matches = [];
html.replace(/<p>(.*?)<\/p>/g, function () {
    //arguments[0] is the entire match
    matches.push(arguments[1]);
});
Explosion Pills
  • 188,624
  • 52
  • 326
  • 405
  • Ok so, how can I do using Jade and NodeJS for extract the text between

    and

    ?
    – tonymx227 Feb 19 '13 at 23:57
  • @tonymx227 I don't really know what you mean .. that code is just raw JavaScript, so you should be able to use it with any JS interpreter – Explosion Pills Feb 19 '13 at 23:58
  • Yes I know. But with controller I send to my Jade view (for example) all the posts, with my view I try to get the content of a post without tag... ${posts.content.match('/

    (.*?)<\/p>/g')} but it doesn't work...

    – tonymx227 Feb 20 '13 at 00:05
  • I don't know how to use Jade views, so I wouldn't really be able to help you there. I said to use `.replace`, not `match`, though – Explosion Pills Feb 20 '13 at 00:07
  • I asked a new question because it's not the same subject. But thank you anyway. – tonymx227 Feb 20 '13 at 00:19
  • @kilianc you could use `[\w\W]` instead of `.` for that or the `s` / dotall flag – Explosion Pills Dec 02 '15 at 14:05
1

To get more than one match of a pattern the global flag g is added.
The match method ignores capture groups () when matching globally, but the exec method does not. See MDN exec.

var m,
    rex = /<p>(.*?)<\/p>/g,
    str = '<p> My content. </p> <img src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTJ9ylGJ4SDyl49VGh9Q9an2vruuMip-VIIEG38DgGM3GvxEi_H"> <p> Second sentence. </p>';

while ( ( m = rex.exec( str ) ) != null ) {
    console.log( m[1] );
}

//  My content. 
//  Second sentence. 

If there may be newlines between the paragraphs, use [\s\S], meaning match any space or non-space character, instead of ..

Note that this kind of regex will fail on nested paragraphs as it will match up to the first closing tag.

MikeM
  • 13,156
  • 2
  • 34
  • 47
  • There's no such thing as "nested paragraphs". A `

    ` does not require a closing tag. A block element that occurs after an open `

    ` tag implies a closing `

    ` tag. Your regexp will treat multiple paragraphs without closing tags as one single paragraph.
    – gilly3 Mar 07 '13 at 23:38
  • @gilly3. XHTML requires the closing tag and I think the OP makes it quite clear in his question he is looking for the content between opening and closing p tags. It is pretty obvious my answer assumes the closing tags and if there isn't any _the OP's_ regex (not mine) won't match anyway. Nevertheless, I think your observation is worthwhile, so thank you. – MikeM Mar 07 '13 at 23:59