37

I'm attempting to non-greedily parse out TD tags. I'm starting with something like this:

<TD>stuff<TD align="right">More stuff<TD align="right>Other stuff<TD>things<TD>more things

I'm using the below as my regex:

Regex.Split(tempS, @"\<TD[.\s]*?\>");

The records return as below:

""
"stuff<TD align="right">More stuff<TD align="right>Other stuff"
"things"
"more things"

Why is it not splitting that first full result (the one starting with "stuff")? How can I adjust the regex to split on all instances of the TD tag with or without parameters?

Bastien Vandamme
  • 17,659
  • 30
  • 118
  • 200
steventnorris
  • 5,656
  • 23
  • 93
  • 174
  • 1
    Please see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Brian Rasmussen Dec 12 '12 at 16:32
  • 2
    `.` just means a literal dot in character class `[.]`, not 'any character. You _may_ have more success with `[^>]*`, _but_ it would break on a `>` in an attribute (which is one of the reasons why we often look at parsers rather the regexes to manipulate html & xml). – Wrikken Dec 12 '12 at 16:32
  • @Wrikken The HTML here is pretty static. There isn't much variation and I know the regex that would work for it. I didn't go the route of parsers because of that. Is there a way to make the . character mean 'any character' including whitespace? – steventnorris Dec 12 '12 at 16:37
  • I don't know the c# modifiers (in pcre it would be `/s`) to make the dot match all. However `[^>]*>` is functionally equivalent to `(.|\s)*?>`, and probably easier on the regex. – Wrikken Dec 12 '12 at 16:42

3 Answers3

60

For non greedy match, try this <TD.*?>

Jason
  • 3,844
  • 1
  • 21
  • 40
  • 12
    @Hambone Because `?` after the quantifier `*` tells Regex engine to stop eating symbols when it finds the first match of the expression which follows `?`, that is - `>`. The difference is because of greedy vs non-greedy `*`. – JustAMartin Apr 14 '16 at 16:05
18

From https://regex101.com/

  • * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
  • *? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Bastien Vandamme
  • 17,659
  • 30
  • 118
  • 200
17

The regex you want is <TD[^>]*>:

<     # Match opening tag
TD    # Followed by TD
[^>]* # Followed by anything not a > (zero or more)
>     # Closing tag

Note: . matches anything (including whitespace) so [.\s]*? is redundant and wrong as [.] matches a literal . so use .*?.

Chris Seymour
  • 83,387
  • 30
  • 160
  • 202