Regular expression to capture a tag

Question

I have the following html text and in javascript i need to caputure all the tags "p" that have a class "page-break" and then replace it for any text.

I need use regular expression beacuse this html text is going to be processed like a text not like a DOM estrucutre

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Praesent pellentesque tincidunt adipiscing</p>

<p class="page-break">break</p>

<p>Suspendisse a velit at diam facilisis
egestas sit amet a lectus.</p>

<p class="page-break">other</p>

<p>Donec tristique placerat massa vitae hendrerit. Maecenas nec
massa adipiscing sem venenatis vehicula. Suspendisse mattis pretium
libero quis dignissim. Nulla volutpat imperdiet vehicula. Donec ut
tristique neque.</p>

What prevent me to use a dom parser is than i plan to insert a not valid html element i plan transform the previus HTML into this, i need to parse firt like a text to skip html validation and then paste it like this

 <div class="pag visible">
 <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    Praesent pellentesque tincidunt adipiscing</p>
 </div>
 <div class="pag">   
    <p>Suspendisse a velit at diam facilisis
    egestas sit amet a lectus.</p>
 </div>
 <div class="pag">   
    <p>Donec tristique placerat massa vitae hendrerit. Maecenas nec
    massa adipiscing sem venenatis vehicula. Suspendisse mattis pretium
    libero quis dignissim. Nulla volutpat imperdiet vehicula. Donec ut
    tristique neque.</p>
 </div>

as you can see every ".page-break" is replace ir

why do you want a regex? javascript has a pretty decent dom parser. — Daniel A. White, Aug 07 '12 at 15:52
I need use resultar expression beacuase that text is going to be processed like a text not like DOM estructure — eli.rodriguez, Aug 07 '12 at 15:56
Trying to find the thread here, but parsing HTML with reg exps is nearly impossible to do correctly. If you can guarantee that the structure will be consistent it is a little easier. — epascarello, Aug 07 '12 at 15:59
@eli.rodriguez What's preventing you from using the DOM to get the element, then processing the innerHTML "like a text"? — Eric Finn, Aug 07 '12 at 16:01
@epascarello You mean this? http://stackoverflow.com/a/1732454/407071 — Eric Finn, Aug 07 '12 at 16:02
I update the description to explain the situacion and the goal better — eli.rodriguez, Aug 07 '12 at 16:42

score 4 · Answer 1 · edited May 23 '17 at 11:48

4

Don't use regexp to parse HTML. Use DOM instead. If you have plain string, create a DocumentFragment and assign it to its .innerHTML to get DOM.

Find your p tags with getElementsByTagName, check their .className and act accordingly.

edited May 23 '17 at 11:48

Community

1
1

answered Aug 07 '12 at 15:58

Oleg V. Volkov

21,719
4
44
68

I need use resultar expression beacuase that text is going to be processed like a text not like DOM estructure – eli.rodriguez Aug 07 '12 at 15:59
1

@eli.rodriguez you have your string-to-DOM parser in browser in form of innerHTML. – Oleg V. Volkov Aug 07 '12 at 15:59
I update the description to explain the situacion and the goal better – eli.rodriguez Aug 07 '12 at 16:36

score 1 · Accepted Answer · answered Aug 07 '12 at 17:14

1

// your content
var content = '<p>Lorem ips...';
// to match any <p> with correspondent class
var regex = /(<p class.?=.?"page-break">.*<\/p>)+/g;
// to replace it with whatever you need:
content.replace(regex, "<p>MY TEXT HERE</p>");

Example

answered Aug 07 '12 at 17:14

ted

5,219
7
36
63

score 0 · Answer 3 · answered Aug 07 '12 at 15:56

0

Have you thought of using JQuery?

$('p').hasClass('page-break').html('replacement value goes here');

this will replace the contents of  with "replacement value goes here"

or $('p').hasClass('page-break').remove(); will remove the  element entirely.

answered Aug 07 '12 at 15:56

Rich Andrews

4,168
3
35
48

I need use regular expression beacuase that text is going to be processed like a text not like DOM estructure – eli.rodriguez Aug 07 '12 at 15:57

Vaman Kulkarni · Answer 4 · 2012-08-07T16:13:08.840

It is not advisable to parse HTML with regex. You can use XPath for fetching all the  with specified criteria and iterate over the returned list and update the textContent for each  as shown in below snippet.

var pList = document.evaluate("//p[@class='page-break']", document, null, XPathResult.ANY_TYPE, null);   
var item = pList.iterateNext();  
while (item) {  
    item.textContent = "New Text";
    item = pList.iterateNext();  
}

Explanation

//p[@class='page-break'] will fetch all the  elements with class='page-break'. document.evaluate function will return you object of type XPathResult. Using interateNext() function you can get its element. You can set new text using textContent property.

I update the description to explain the situacion and the goal better — eli.rodriguez, Aug 07 '12 at 16:36

Regular expression to capture a tag

4 Answers4