Regex read html values

Question

I have this file that contains the following text (html):

<tr> 
<th scope="row">X:</th> 
<td>343</td> 
</tr> 
<tr> 
<th scope="row">Y:</th> 
<td>6,995 sq ft / 0.16 acres</td> 
</tr>

And I have this method to read the values from X,Y

        private static Dictionary<string, string> FindKeys(IEnumerable<string> keywords, string source)
    {
        var found = new Dictionary<string, string>();
        var keys = string.Join("|", keywords.ToArray());
        var matches = Regex.Matches(source, @"\b(?<key>" + keys + @"):\s*(?<value>)");

        foreach (Match m in matches)
        {
            try
            {
                var key = m.Groups["key"].ToString();
                var value = m.Groups["value"].ToString();
                found.Add(key, value);
            }
            catch
            {
            }
        }
        return found;
    }

I can't get the method to return the values from X,Y

Any thing wrong in the regex expression?

Ordinarily, this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — josh.trow, Apr 25 '11 at 17:24

score 1 · Answer 1 · answered Apr 25 '11 at 17:30

1

You have "" between keyword and value so you need to skip them in your regex like this:

\b(?<key>" + keys + @"):\s*</th>[^<]*<td>(?<value>[^<]*)

And BTW, you need to specify the pattern for "value" - I've specified it as [^<]*.

answered Apr 25 '11 at 17:30

Alex Netkachov

13,172
6
53
85

score 0 · Answer 2 · answered Apr 25 '11 at 19:09

As I'm sure you know, parsing HTML with a regex is never fun. You current regex does not look very close to capturing what your looking for. As such I would recommend two possible alternatives...

Option 1 - If adding a library is acceptable, use the Html Agility Pack. It's blazing fast and very accurate.

Option 2 - If your looking for lighter-weight solution, these source files contain a regex parser for xml/html. To use directly, implement the IXmlLightReader then call the XmlLightParser.Parse method. If your document is a complete HTML document and not a fragment, you can also use the HtmlLightDocument as follows:

HtmlLightDocument doc = new HtmlLightDocument(@"<html> ... </html>");
foreach(XmlLightElement row in doc.Select(@"//tr"))
    found.Add(
        row.SelectSingleNode(@"th").InnerText, 
        row.SelectSingleNode(@"td").InnerText
    );

Option 3 - As always, if the html is xhtml compliant then you can just use an xml parser.

Regex read html values

2 Answers2