1

I have to find all the strings surrounded by "[" and "]" using regex, but avoiding the ones inside the <table></table> block, for example:

<html>
<body>
<p><table>
   <tbody>
      <tr>
         <td style="border-style: solid; border-width:1px;">
            <span style="font-family: Courier;">[data1]</span>
         </td>
         <td style="border-style: solid; border-width:1px;">
            <span style="font-family: Courier;">[data10]</span>
         </td>
      </tr>
   </tbody>
</table>
</p>
<p>[data3]&nbsp;&nbsp;[data4]&nbsp;&nbsp;[data5]</p>
</body>
</html>

in this case only [data3], [data4] and [data5] should be found. So far I have this: @"(((?<!<span>)(\[[a-zA-Z_0-9]+)](?!<\/span>))|((?<!<span>)(\[[a-zA-Z_0-9]+)])|((\[[a-zA-Z_0-9]+)](?!<\/span>)))(?!.*\1)" That finds all the [] blocks that are not surrounded by tags and I tried adding a negative lookahead and lookbehind of but it doesn't work, it stills gets the ones inside the table block.

Hope you guys can help me with this.

  • 3
    Obligatory link: [Do not use regex to parse HTML. TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/1954610) – Tom Lord Jul 14 '20 at 08:52
  • Which tool/language are you using? Using a regex to search the **inner-text only** (e.g. `"[data3]  [data4]...."` is fine, but the first thing you need to do is parse the HTML - e.g. using an XPATH. Trying to do the whole search via regex is technically impossible; at best, you'll have an extremely complicated solution that works for *most* inputs - rather than a simple XPATH (or similar) that works for *all* inputs. – Tom Lord Jul 14 '20 at 08:56
  • @TomLord What kind of URL is that? I thought my graphics card was on the blink for a sec! – JGFMK Jul 14 '20 at 09:03
  • 1
    [Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. – Toto Jul 14 '20 at 09:05
  • @JGFMK It's called [Zalgo Text](https://lingojam.com/ZalgoText) – Tom Lord Jul 14 '20 at 09:05
  • @TomLord Thank you, I will do it this way, I wanted to avoid using an external parser, but it seems like this will be the only approach – Valentín Corral García Jul 14 '20 at 10:38

1 Answers1

-1

Below regex will return your all [data] which enclose in <p> </p> tag.

/<p.*?>\[(.*?)\]<*.p>/g

so above regex will return this <p>[data3]&nbsp;&nbsp;[data4]&nbsp;&nbsp;[data5]</p> from your above HTML code.

When you get that string from above regex then use below regex to get only all [data] string.

/\[(.*?)\]/g

so above regex will return " [data3][data4][data5] " from above string.

Tom Lord
  • 27,404
  • 4
  • 50
  • 77
hami
  • 54
  • 2
  • This won't work for even the most trivial of variations. For example, if the input contains `

    foo [data1] bar

    ` then the regex doesn't match.
    – Tom Lord Jul 14 '20 at 10:48
  • It's a good example of why the golden advice is *don't use regex to parse HTML* (if you want a proper, comprehensive solution). You could keep making that regex increasingly complex trying to cover more edge cases, and I could keep giving more and more examples that still make it fail... Or, you could stop trying to do the whole thing in regex, an just use an HTML parser. – Tom Lord Jul 14 '20 at 10:52
  • Your *second* pattern (`/\[(.*?)\]/g`) is perfectly fine for finding matches in the inner-text. But using regex to find that inner-text is a pathway to madness. (OP didn't even say that the text must be inside a `

    ` tag!! All they said is that it's *not* in a `

    `.)
    – Tom Lord Jul 14 '20 at 10:54
  • Yes, for HTML it's better to use parser instead of Regex. – hami Jul 14 '20 at 12:43