C# Regex to grab 2 pieces of info from each HTML TR element - located inside different TD elements

Question

Given the following HTML...

<table>
    <tr>
        <td><strong>Name 1</strong></td>
        <td>Info and ignore <a href="/gohere"/>this</a></td>
        <td><a href="MySpecialAction?field=&list=10000">Edit</a></td>
    </tr>
    <tr>
        <td><strong>Name 2</strong></td>
        <td>Info and ignore <a href="/gohere"/>this</a></td>
        <td><a href="MySpecialAction?field=&list=10001">Edit</a></td>
    </tr>
</table>

Is it possible to write a single C# Regex that'll grab the 'name' (found withing td/strong) and 'listid' (found on with href containing MySpecialAction)?

I've got it grabbing the name (probably not efficient, but I was hoping I could write one expression that, given above, would have 2 matches and each match would have two groups (named 'name' and 'id').

<strong\b[^>]*>(.*?)<\/strong>

Match1.name=Name 1  
Match1.id=10000  
Match2.name=Name 2  
Match2.id=10001

Thanks in advance.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. Don't use regex to parse html ! — mybirthname, Nov 25 '14 at 08:55
@spender please enlighten me on what an 'html parser' is? I wasn't looking to parse the entire html body, but rather pluck a few strings from the html. I was looking at the html as simply a 'big string' and Regex as the tool to match part of that 'string'. Obviously my knowledge in these areas are on the low side. Happy to use whatever is easiest (and quite honestly easiest to read, which I've never taken proper time to learn regex syntax) is correct tool for the job. — Terry, Nov 25 '14 at 15:10

score 0 · Accepted Answer · answered Nov 25 '14 at 09:21

0

Parsing html with regex is, of course, fraught with peril and singularities. But IF you are doing a quick and dirty script and you are assuming your html structure is not weird and non-nested, and IF you really want to cram two basically unrelated regexes into a single pattern to parse out your two tokens, and IF your hrefs are always surrounded by double quotes and not single quotes, (etc) you could try this:

/(?:<strong\b[^>]*>(?<name>.*?)<\/strong>|MySpecialAction\?.*?list=(?<id>[^&"]+))/

this works for your given input, and captures the tokens into named groups "name" and "id". (one named group per match, test accordingly!)

answered Nov 25 '14 at 09:21

Scott Weaver

7,192
2
31
43

So, this seem to do almost exactly what I need...definitely sufficient for my needs. I clicked the link in comments above about not using Regex but rather a 'HtmlParser'...I guess a) I'm confused by using Regex in html is immediately flagged as wrong, b) What are you calling a Html parser? c) Trying other mechanisms (i.e. XElement and querying) also has problems when html isn't proper Xml, which the html I'm trying to grab a few things out of is. Thanks for the answer, and will see if anyone comments as to why I chose such a bad tool for this job. – Terry Nov 25 '14 at 09:37
(X)HTML is just too weird, complicated and nested for dependable parsing with regular expressions - you'd be surprised what can pass for "legal" (x)html and mess up your regex. And if that's the first time you read Bobince's epic rant, then your night has improved! :) – Scott Weaver Nov 25 '14 at 09:42
Improved if I read it, or improved if this is first time I came across it? :) – Terry Nov 25 '14 at 15:04

C# Regex to grab 2 pieces of info from each HTML TR element - located inside different TD elements

1 Answers1