2

I have the following regular expression:

(</?[a-z][a-z0-9]*[^<>]*>)

I have the following text:

<DIV><P class='abc'>Hello <B>Mister</B>! How are you >..< doing? </P>
<I>I'm good</I></DIV>

Now I'd like to split the text per tag:

<DIV>
<P class='abc'>
Hello 
<B>
Mister
</B>
! How are you >..< doing?

</P>
<I>
I'm good
</I>
</DIV>

How can I do this with Javascript regex?
Is was able to get it to work but had to start over because javascript doesn't support lookbehinds.

(basically split on html tags and keep the delimiters)

Edit:
My goal with this is to use html to store formatting. I want to feed the html above to a javascript object. The javascript object separates the formatting from the text and does action A for formatting objects and action B for regular text.

I know it sounds a bit vague, but I don't want to reveal too much about the project.

Yvo
  • 18,681
  • 11
  • 71
  • 90
  • Out of curiosity, is there a reason you're trying to parse HTML using a regex? Unless you have a lot of control over the input, [you may have a few problems](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html). – NT3RP Sep 05 '11 at 16:42
  • 2
    First of all, your HTML is not valid, thus making it much harder to parse - `>..<` should be encoded as `>..&lt`. Secondly, [Parsing HTML with regex summons tainted souls into the realm of the living](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – shesek Sep 05 '11 at 16:44

3 Answers3

4

I actually agree with Omar on this issue, but I'll give you the regex anyway. :)

\<[^>]+?>|.+?(?=(?:<[^><]+?>|$))
Paul Walls
  • 5,884
  • 2
  • 22
  • 23
  • Thanks, small question: if the html ends with a piece of text it doesn't work. How do I fix that? (now it has to end with a tag) – Yvo Sep 05 '11 at 16:54
  • What about if there is a ``? – 6502 Sep 05 '11 at 16:55
  • 1
    @6502 There are probably some edge cases that will break any regex. See shesek's link in the comment to the OP. :) – Paul Walls Sep 05 '11 at 17:20
  • This is a common misconception. While it's true that no regexp can parse an html document (or any hierarchical syntax) it doesn't mean a regexp cannot parse tags. – 6502 Sep 05 '11 at 17:34
1

This was mentioned for infinite times. Regex is not the right tool to do that. Regex is good with small, short and finite amount of text. The likes of checking and validating user input.

I would suggest that you learn more about the browser DOM model. Each tag is an object in the DOM, and can be selected with JavaScript, and also referenced. You can play with your data that way.

Omar Abid
  • 15,753
  • 28
  • 77
  • 108
0

Cannot test right now, but what about

/(<\/?[a-zA-Z]+([^"]|"(\\.|[^"])*")*>)|([^<]|<[^a-zA-Z])*/
6502
  • 112,025
  • 15
  • 165
  • 265