How to avoid html blocks with regex

Question

I have to find all the strings surrounded by "[" and "]" using regex, but avoiding the ones inside the <table></table> block, for example:

<html>
<body>
<p><table>
   <tbody>
      <tr>
         <td style="border-style: solid; border-width:1px;">
            <span style="font-family: Courier;">[data1]</span>
         </td>
         <td style="border-style: solid; border-width:1px;">
            <span style="font-family: Courier;">[data10]</span>
         </td>
      </tr>
   </tbody>
</table>
</p>
<p>[data3]&nbsp;&nbsp;[data4]&nbsp;&nbsp;[data5]</p>
</body>
</html>

in this case only [data3], [data4] and [data5] should be found. So far I have this: @"(((?<!<span>)(\[[a-zA-Z_0-9]+)](?!<\/span>))|((?<!<span>)(\[[a-zA-Z_0-9]+)])|((\[[a-zA-Z_0-9]+)](?!<\/span>)))(?!.*\1)" That finds all the [] blocks that are not surrounded by tags and I tried adding a negative lookahead and lookbehind of but it doesn't work, it stills gets the ones inside the table block.

Hope you guys can help me with this.

Obligatory link: [Do not use regex to parse HTML. TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/1954610) — Tom Lord, Jul 14 '20 at 08:52
Which tool/language are you using? Using a regex to search the **inner-text only** (e.g. `"[data3] [data4]...."` is fine, but the first thing you need to do is parse the HTML - e.g. using an XPATH. Trying to do the whole search via regex is technically impossible; at best, you'll have an extremely complicated solution that works for *most* inputs - rather than a simple XPATH (or similar) that works for *all* inputs. — Tom Lord, Jul 14 '20 at 08:56
@TomLord What kind of URL is that? I thought my graphics card was on the blink for a sec! — JGFMK, Jul 14 '20 at 09:03
[Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. — Toto, Jul 14 '20 at 09:05
@JGFMK It's called [Zalgo Text](https://lingojam.com/ZalgoText) — Tom Lord, Jul 14 '20 at 09:05
@TomLord Thank you, I will do it this way, I wanted to avoid using an external parser, but it seems like this will be the only approach — Valentín Corral García, Jul 14 '20 at 10:38

score -1 · Answer 1 · edited Jul 14 '20 at 10:46

-1

Below regex will return your all [data] which enclose in <p> </p> tag.

/<p.*?>\[(.*?)\]<*.p>/g

so above regex will return this <p>[data3]  [data4]  [data5]</p> from your above HTML code.

When you get that string from above regex then use below regex to get only all [data] string.

/\[(.*?)\]/g

so above regex will return " [data3][data4][data5] " from above string.

edited Jul 14 '20 at 10:46

Tom Lord

27,404
4
50
77

answered Jul 14 '20 at 09:19

hami

54
2

This won't work for even the most trivial of variations. For example, if the input contains `
foo [data1] bar
` then the regex doesn't match. – Tom Lord Jul 14 '20 at 10:48
It's a good example of why the golden advice is *don't use regex to parse HTML* (if you want a proper, comprehensive solution). You could keep making that regex increasingly complex trying to cover more edge cases, and I could keep giving more and more examples that still make it fail... Or, you could stop trying to do the whole thing in regex, an just use an HTML parser. – Tom Lord Jul 14 '20 at 10:52
Your *second* pattern (`/\[(.*?)\]/g`) is perfectly fine for finding matches in the inner-text. But using regex to find that inner-text is a pathway to madness. (OP didn't even say that the text must be inside a `
` tag!! All they said is that it's *not* in a `
`.)
– Tom Lord Jul 14 '20 at 10:54
Yes, for HTML it's better to use parser instead of Regex. – hami Jul 14 '20 at 12:43

How to avoid html blocks with regex

1 Answers1