Regular Expression in .NET to parse HTML

Question

I'm trying to use regular expression in to match the following in a text file where the orlst part in the string may be any character a-Z:

<frame src="orlst.html" name="list">

So far I've only been able to use a pattern of (<frame src=) to return any results. But it only returns <frame src= in the matches collection.

Any ideas how I could add to my pattern to return what I'm looking for?

HTML is not a regular language and therefor cannot be reliably parsed using regular expressions. You might be able to make it work for you in this specific instance but it would be better to use an HTML parsing library. — evanmcdonnal, Mar 15 '14 at 00:54
and more importantly http://stackoverflow.com/a/1732454/585552 — Greg, Mar 15 '14 at 00:55

score 0 · Answer 1 · answered Mar 15 '14 at 01:02

0

Is this what you are looking for?

(<frame src="[a-zA-Z]*.html" name="list">)

It matches your test string and any string where the 'orlist' part is a series of letter. Like others have commented though, you may be better off with an HTML parser.

answered Mar 15 '14 at 01:02

Connor Pearson

63,902
28
145
142

score 0 · Answer 2 · answered Mar 15 '14 at 01:27

0

Try to using the HTML agility pack here's an example using regex and an image

        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load(link);
        doc.OptionUseIdAttribute = true;
        doc.OptionFixNestedTags = true;
        string Img=string.Empty ;
        if (doc.DocumentNode != null)
        {
            try {
                HtmlNode img3 = doc.DocumentNode.SelectSingleNode("//*[@class=\"thumb\"]//img[@src]");
                Img = img3.Attributes["src"].Value;
            }
            catch {
                Img = "";
            };
        }

answered Mar 15 '14 at 01:27

fuzzybear

2,325
3
23
45

I originally tried HTML Agility pack but wasn't having much success with it. I was trying to see if I could get any information by looking at the childnodes. The HTML code that I'm trying to parse is from this site. (http://www.raws.dri.edu/wraws/ccaF.html) – user3421997 Mar 17 '14 at 15:51
Suggestion site is using frames, have you tried with the original URL www.raws.dri.edu/wraws/cca.html, there's lots of post using the pack so you should defo be able to pull whatever you need – fuzzybear Mar 18 '14 at 14:32
that's the direction I have taken, I'm opening the page that's in the frame and extracting the data using HTMLAgilitypack – user3421997 Mar 18 '14 at 16:33
Great to hear, will try and help if you get stuck, good luck – fuzzybear Mar 18 '14 at 18:20

Regular Expression in .NET to parse HTML

2 Answers2