0

I'm trying to figure out the regular expressions for the following and can't seem to get it right. Can someone advise me?

In a nutshell I have an htmlString which is:

        htmlString = "<HTML><HEAD></HEAD><BODY>Here are some images.</br>1) <IMG style='MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px' align=right src='images/sample001.jpg'>2) <IMG style='MARGIN-BOTTOM: 25px; MARGIN-LEFT: 25px' align=right src='images/sample002.png'></br> And some docs as well.</br>1) href='javascript:parent.POPUP({url:'testDoc001.htm',type:'shared',width:600,height:645})'></br>2) href='javascript:parent.POPUP({url:'testDoc002.html',type:'shared',width:700,height:712})'></br></BODY></HTML>";

I run this through the following routine in C#, WPF:


    private static List<string> ExtractData(string htmlString)
    {
        List<string> data = new List<string>();

        //***  Get The Images ***
        string pattern = @"<img .* src='(.+\.(jpg|bmp|png))'";

        Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
        MatchCollection matches = rgx.Matches(htmlString);

        for (int i = 0, l = matches.Count; i < l; i++)
        {
            data.Add(matches[i].Value);
        }

        //***  Get Html Pages ***
        pattern = @"url:'([^']*)'";

        rgx = new Regex(pattern, RegexOptions.IgnoreCase);
        matches = rgx.Matches(htmlString);

        for (int i = 0, l = matches.Count; i < l; i++)
        {
            data.Add(matches[i].Value);
        }

        return data;
    }--------------------------------------------------------------------------------------

and the result I get is:

[0] = "< IMG style='MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px' align=right src='images/sample001.jpg'>2) < IMG style='MARGIN-BOTTOM: 25px; MARGIN-LEFT: 25px' align=right src='images/sample002.png'"

[1] = "url:'testDoc001.htm'"

[2] = "url:'testDoc002.html'"

What I really want is:

[0] = "images/sample001.jpg"

[1] = "images/sample002.png"

[2] = "testDoc001.htm"

[3] = "testDoc002.html"

Can someone tell me what I'm doing wrong in my Regular Expression?

Thanks

anubhava
  • 761,203
  • 64
  • 569
  • 643
Ann Sanderson
  • 407
  • 3
  • 8
  • 17
  • 1
    See the first answer here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Cfreak Apr 16 '12 at 14:00
  • 1
    What you want is probably doable through regular expressions, but it won't be as neat and tidy as you'd expect. You should really use a parser to do this. http://stackoverflow.com/a/1732454/355724 – VeeArr Apr 16 '12 at 14:00
  • possible duplicate of [Regular Expression to get the SRC of images in C#](http://stackoverflow.com/questions/4257359/regular-expression-to-get-the-src-of-images-in-c-sharp) – H H Apr 16 '12 at 14:25

1 Answers1

1

You'd better use the HTML Agility Pack for such work. As mentioned by others, regex for parsing HTML, besides very specific cases, is a baaaad thing. Anyway, there are several problems with your regexs. The first one should resemble this:

<img.+?src\s*=\s*\'(.*?\.(jpg|bmp|png))'
David Brabant
  • 41,623
  • 16
  • 83
  • 111