0

I have a large malformed test HTML document which I need to get the numbers out of:

I'd like to get the primary ratio out. I'm using this regular expression:

(?<=Primary ratio</TD><TD>--</TD><TD>).*(?=</TD>)

On this string:

Primary ratio</TD><TD>--</TD><TD>10.52</TD><TD>14.97</TD><TD></TD></TR><TR align='right'><TD align='left'>Flip Ratio</TD><TD>-122.81</TD><TD>1.13</TD><TD>1.50</TD><TD></TD></TR><TR align='right'><TD align='left'>Secondary Ratio</TD><TD>--</TD><TD>0.70</TD><TD>0.70</TD><TD></TD></TR><TR align='right'><TD align='left'>RM Ratio</TD><TD>--</TD><TD>2.02</TD>

But I get this as a result:

10.52</TD><TD>14.97</TD><TD></TD></TR><TR align='right'><TD align='left'>Flip Ra
tio</TD><TD>-122.81</TD><TD>1.13</TD><TD>1.50</TD><TD></TD></TR><TR align='right
'><TD align='left'>Secondary Ratio</TD><TD>--</TD><TD>0.70</TD><TD>0.70</TD><TD>
</TD></TR><TR align='right'><TD align='left'>RM Ratio</TD><TD>--</TD><TD>2.02

I don't want that, I just want the 10.52 number in the first tag.

I mean, it found the start of the string perfectly, but it didn't find the first . What am I doing wrong?

Mike
  • 1,532
  • 3
  • 21
  • 45
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Rune FS Jul 25 '10 at 08:19

2 Answers2

2

Use an HTML parser instead of a RegEx - the HTML Agility Pack is a good one.

In general, regular expressions are not suitable for usage with HTML, as HTML is not a regular language. This is particularly true if you are working with HTML from different sources. See here for a compelling demonstration.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • It's a really nice malformed document. I don't know how the agility pack handles it. I'd just prefer to use regex in this case. I'll definitely keep this in mind in the future though. – Mike Jul 25 '10 at 07:34
  • @Mike - from the site: `The parser is very tolerant with "real world" malformed HTML.` – Oded Jul 25 '10 at 07:35
  • That, or an XML parser. I like XPath. Also, @Mike, read the first answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - because it's relevant and you'll enjoy it. – Lunivore Jul 25 '10 at 07:38
  • If it's malformed then Flynn1179's answer is probably what you're looking for. – Lunivore Jul 25 '10 at 07:39
  • 1
    @Lunivore - XML parsers are not suitable for valid HTML either - for example `
    ` is valid HTML (4.01), but not valid XML. Of course, XHTML is also XML, so that's a different issue.
    – Oded Jul 25 '10 at 07:42
  • Sure. Most modern HTML is XHTML anyway, because other people like XPath too. I think I used Regex the last time I needed to do something like this, but I acknowledge the complete unmaintainability of my code and hang my head in shame. Shame! – Lunivore Jul 25 '10 at 09:46
  • I also believe we are using it for different purposes. I've got a document I need to find specific information within, whereas the other question is asking about matching multiple tags. If I was parsing a HTML document to get everything inside every

    for example, I would definitely use a HTML parser. I guess for different purposes, different tools can come into play.

    – Mike Jul 25 '10 at 12:48
  • @Mike - fair comment. Absolutely agree. – Oded Jul 25 '10 at 13:41
2

Replace .* with .*? near the end of your regex; that should stop it from matching too much. Normally it'll much as much as possible that fits the pattern, by adding the ?, you ask it to match as little as possible instead.

Flynn1179
  • 11,925
  • 6
  • 38
  • 74
  • This behaviour is known as "greedy" matching, by the way. The syntax Flynn proposes explicitly tells the regex parser to match non-greedy. – kander Jul 25 '10 at 07:39