Javascript Regex: Match text NOT part of a HTML tag

Question

I would really want to have a Regex that is executable in node.js (so no jQuery DOM Handling etc., because the tags can have a different nesting) that matches all the text that is NOT a HTML tag or part of it into seperate groups.

E.g. I'd like to match "5","ELT.","SPR"," ","plo","Unterricht"," ","&nbsp" and "plo" from that String:

<tr class='list even'>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">5</span>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">ELT.</span></b>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">SPR</span></b>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <strike><span style="color: #010101">pio</span></strike>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">Unterricht</span>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">pio</span></b>
    </td>
</tr>

I can assure that there will be no ">"'s within the tags.

The solution I found was (?<=^|>)[^><]+?(?=<|$), but that won't work in node.js (probably because the lookaheads? It says "Invalid group")

Any suggestions? (and yes, I really think that Regex is the right way to go because the html may be nested in other ways and the content always has the same order because it's a table)

I love linking to this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — NimChimpsky, Sep 24 '11 at 17:05
Is this what you are looking for? http://stackoverflow.com/questions/822452/strip-html-from-text-javascript — Rusty Fausak, Sep 24 '11 at 17:05
You cannot use regular expressions to parse HTML (this is the point of the link @NimChimpsky gave you), because HTML is not a regular language. Any attempt to use regular expressions, solely, to parse HTML ***will fail***. You have no choice but to actually *parse* the HTML. — T.J. Crowder, Sep 24 '11 at 17:08
@rfausak: No, because the OP has said clearly they're not running in a browser. — T.J. Crowder, Sep 24 '11 at 17:08
If you want to match something based on the context around and don't have lookarounds available, then ... no. — Howard, Sep 24 '11 at 17:09
I think the second answer down on the question I linked has such a solution. — Rusty Fausak, Sep 24 '11 at 17:09
@NimChimpsky well, I see your point, but parsing the html and looking for all leaves in the tree simply seems like an overkill. — iStefo, Sep 24 '11 at 17:37
@rfausak I'm afraid I can not rely on browsers in this case, but your link contains a solution, thanks — iStefo, Sep 24 '11 at 17:39
_"Any attempt to use regular expressions, solely, to parse HTML will fail."_ -- no it wont. It will probably succeed, maybe even for over 90% of the cases, but that's the problem: **it _appears_ to be successful** so the developer doesn't realise they have bugs with certain input. // Of course, this is referring to uncontrolled/unknown input. If you know exactly what you have and are _matching text_ (not parsing HTML) then, in certain situations, it can be possible to craft a working regex (though that still doesn't mean it's necessarily easy or the best solution). — Peter Boughton, Sep 24 '11 at 17:45
As for the question itself... iStefo, lookaheads work fine in JS - it's the lookbehind that is the problem. So you could match on `(?:^|>)[^<>]+(?=<|$)` and then for each item replace `^>` with empty string - of course, remembering that regex is _not_ guaranteed to be the best way to do this, and with unpredictable input you almost certainly will get incorrect matches at some point. — Peter Boughton, Sep 24 '11 at 17:56

Narendra Yadala · Accepted Answer · 2011-09-24T17:37:51.013

Try 'yourhtml'.replace(/(<[^>]*>)/g,' ')

'<tr class="list even"><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">5</span></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">ELT.</span></b></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">SPR</span></b></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><strike><span style="color: #010101">pio</span></strike></td><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">Unterricht</span></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">pio</span></b></td></tr>'.replace(/(<[^>]*>)/g,' ')

It will give a space delimited text that you want to match (which you can split on space).

Yepp, that's what I'll do, thx. But I will use a nice UTF-8 Char or sth. because my Values may contain whitespaces as well I think... — iStefo, Sep 24 '11 at 17:41

score 2 · Answer 2 · answered Sep 24 '11 at 17:39

2

Maybe you can split directly using the tags themselves:

html.split(/<.*?>/)

Afterwards you have to remove the empty strings from the result.

answered Sep 24 '11 at 17:39

Howard

38,639
9
64
83

Javascript Regex: Match text NOT part of a HTML tag

2 Answers2