6

I would really want to have a Regex that is executable in node.js (so no jQuery DOM Handling etc., because the tags can have a different nesting) that matches all the text that is NOT a HTML tag or part of it into seperate groups.

E.g. I'd like to match "5","ELT.","SPR"," ","plo","Unterricht"," ","&nbsp" and "plo" from that String:

<tr class='list even'>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">5</span>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">ELT.</span></b>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">SPR</span></b>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <strike><span style="color: #010101">pio</span></strike>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">Unterricht</span>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">pio</span></b>
    </td>
</tr>

I can assure that there will be no ">"'s within the tags.

The solution I found was (?<=^|>)[^><]+?(?=<|$), but that won't work in node.js (probably because the lookaheads? It says "Invalid group")

Any suggestions? (and yes, I really think that Regex is the right way to go because the html may be nested in other ways and the content always has the same order because it's a table)

Bakudan
  • 19,134
  • 9
  • 53
  • 73
iStefo
  • 418
  • 3
  • 9
  • 2
    I love linking to this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – NimChimpsky Sep 24 '11 at 17:05
  • Is this what you are looking for? http://stackoverflow.com/questions/822452/strip-html-from-text-javascript – Rusty Fausak Sep 24 '11 at 17:05
  • 1
    You cannot use regular expressions to parse HTML (this is the point of the link @NimChimpsky gave you), because HTML is not a regular language. Any attempt to use regular expressions, solely, to parse HTML ***will fail***. You have no choice but to actually *parse* the HTML. – T.J. Crowder Sep 24 '11 at 17:08
  • @rfausak: No, because the OP has said clearly they're not running in a browser. – T.J. Crowder Sep 24 '11 at 17:08
  • If you want to match something based on the context around and don't have lookarounds available, then ... no. – Howard Sep 24 '11 at 17:09
  • I think the second answer down on the question I linked has such a solution. – Rusty Fausak Sep 24 '11 at 17:09
  • @NimChimpsky well, I see your point, but parsing the html and looking for all leaves in the tree simply seems like an overkill. – iStefo Sep 24 '11 at 17:37
  • @rfausak I'm afraid I can not rely on browsers in this case, but your link contains a solution, thanks – iStefo Sep 24 '11 at 17:39
  • _"Any attempt to use regular expressions, solely, to parse HTML will fail."_ -- no it wont. It will probably succeed, maybe even for over 90% of the cases, but that's the problem: **it _appears_ to be successful** so the developer doesn't realise they have bugs with certain input. // Of course, this is referring to uncontrolled/unknown input. If you know exactly what you have and are _matching text_ (not parsing HTML) then, in certain situations, it can be possible to craft a working regex (though that still doesn't mean it's necessarily easy or the best solution). – Peter Boughton Sep 24 '11 at 17:45
  • As for the question itself... iStefo, lookaheads work fine in JS - it's the lookbehind that is the problem. So you could match on `(?:^|>)[^<>]+(?=<|$)` and then for each item replace `^>` with empty string - of course, remembering that regex is _not_ guaranteed to be the best way to do this, and with unpredictable input you almost certainly will get incorrect matches at some point. – Peter Boughton Sep 24 '11 at 17:56

2 Answers2

3

Try 'yourhtml'.replace(/(<[^>]*>)/g,' ')

'<tr class="list even"><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">5</span></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">ELT.</span></b></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">SPR</span></b></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><strike><span style="color: #010101">pio</span></strike></td><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">Unterricht</span></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">pio</span></b></td></tr>'.replace(/(<[^>]*>)/g,' ')

It will give a space delimited text that you want to match (which you can split on space).

Narendra Yadala
  • 9,554
  • 1
  • 28
  • 43
  • Yepp, that's what I'll do, thx. But I will use a nice UTF-8 Char or sth. because my Values may contain whitespaces as well I think... – iStefo Sep 24 '11 at 17:41
2

Maybe you can split directly using the tags themselves:

html.split(/<.*?>/)

Afterwards you have to remove the empty strings from the result.

Howard
  • 38,639
  • 9
  • 64
  • 83