-2

I am trying to create a regex for the following String-

<tr>
        <td colspan=2>
        <p><b>
        CITY Head: 
        <span >
        <span >##CITY##</span>
        <o:p></o:p>
        </span>
        </b>
        </p>
        </td>
        <td colspan=1>

I want to find the whole TD block having CITY Head in it. I could come with the following regex.

<td(.*)[\s](.*)[\s]+CITY Head+(.*)[\s](.*)[\s](.*)[\s](.*)[\s](.*)[\s](.*)[\s](.*)[\s]+<\/td>

Basically I had to write (.*)[\s] for all the lines above and below the CITY Head. But this can be different in different cases.

Therefore, I am looking for a general way to combine all the (.*)[\s] into something independent of the number of lines.

Invisible
  • 11
  • 2
  • 2
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – HRgiger Jun 10 '16 at 07:31
  • 1
    But as bobince says: *"So go on, parse HTML with regex, if you must. It's only broken code, not life and death."* – Stephen C Jun 10 '16 at 08:14
  • Yes, following bobince's advice, one could eventually study tempered greedy tokens. Then, having issues with performance, go on to study unroll the loop technique. And in the end, just realize that a DOM parser was so much easier. – Wiktor Stribiżew Jun 10 '16 at 08:35

1 Answers1

0

[\s\S]*? will match the smallest possible number (* = 0 or more, ? = ungreedy) of whitespace (\s) or non-whitespace (\S) (ie any) characters.

<td((?!<\/?td)[\s\S])*?CITY Head[\s\S]*?<\/td>

The assertion (?!<\/?td) makes sure the section before CITY Head doesn't span more than one table cell.

But using a regex isn't a reliable way of parsing HTML. In particular, this regex might pull out the wrong result if the HTML contains a syntax error.

Matt Raines
  • 4,149
  • 8
  • 31
  • 34
  • The above regex would take all the all the TDs which are coming before the "CITY Head". The regex has to be designed in a way where only first TD before CITY Head comes in parsing. – Invisible Jun 10 '16 at 09:20
  • True, should have tested. I've added a negative assertion to fix it. This is why I usually respond to these questions with "You can't parse HTML with regex" ;) – Matt Raines Jun 10 '16 at 09:29
  • Perfect. Thanks a lot. :-) – Invisible Jun 10 '16 at 09:37