3

I'm using Notepad++ to clean up a long and messy HTML table, and I'm trying to use regular expressions.

I need to remove all the table rows that don't contain a specific value (may I call that substring?).

After having all the file contents unwrapped, I've been able to use the following regular expression to select, one by one, every table row with all its contents:

<tr>.+?</tr>

How can I improve the regular expression in order to select and replace only table rows containing, somewhere inside a <td> part of them, that defined substring?

I don't know if this does matter but the structure of every table row is the following (I've put there every HTML tag, the dots stand for standard content/values)

<tr>
    <td> ... </td>
    <td> ... </td>
    <td> <a sfref="..." href="...">!! SUBSTRING I HAVE TO MATCH HERE !!</a> </td>
    <td> <img /> </td>
    <td> ... </td>
    <td> ... </td>
    <td> ... </td>
    <td> ... </td>
</tr>
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user1821136
  • 49
  • 1
  • 2
  • Are you looking for a specific string (if, so can't you just include that in your regular expression)? Or looking for any content inside of an anchor tag? – mellamokb Nov 13 '12 at 16:02
  • 4
    It's 2012. Stop trying to parse HTML with regular expressions. Use an XML parser. –  Nov 13 '12 at 16:04
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) –  Nov 13 '12 at 16:04
  • 1
    This question is well-formulated. I don't think the downvotes are reasonable because the question itself is fine. – pimvdb Nov 13 '12 at 16:22

1 Answers1

6

You should rather write a little script in a programming language that contains a simple DOM parser, because no regex solution can ever be perfect.

Also, your question seems a bit contradictory to me. First you say, you want to remove all rows that don't contain a specific substring. Then you say you want to match all rows that do contian a specific substring.

Anyway, here is a regex makeshift solution for both cases. To ensure SUBSTRING occurs inside a row, you need to use this:

<tr>((?!</tr>).)+?SUBSTRING.+?</tr>

(?!...) is a negative lookahead. It might not be supported before Notepad++ 6, so make sure you update. The lookahead makes sure that never go past the end of one table row, just to find SUBSTRING in the next one. It does this by asserting for every single character in our +? repetition, that it does not mark the beginning of </tr>.

To ensure that SUBSTRING does not occur inside the row, we can simply put SUBSTRING into that negative lookahead we already have:

<tr>((?!SUBSTRING).)+?</tr>

Note that both solutions will start to crumble if you have additional whitespace in your tags or attributes in the opening tags, and similar things. Which is why a solution using a DOM parser is highly recommended.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130