1

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

I am trying to search the following HTML string to get the cost of these products:

<div id=menu>
  <p>A hamburger without cheese costs $5.</p>
  <p>A cheeseburger with one patty costs $6.</p>
</div>

I was able to successfully get the price of each item using the following expressions:

string hamburger = "<p>A hamburger[^\\$]+\\$(?<price>.*?).</p>";
string cheeseburger = "<p>A cheeseburger[^\\$]+\\$(?<price>.*?).</p>"

    public string GetProductPrice(string expression)
    {
        expression = Regex.Unescape(expression);
        Regex regex = new Regex(expression);
        MatchCollection mc = regex.Matches(MENU_DIV_STRING);

        if (mc.Count > 0 && mc[0].Groups.Count == 2)
            return mc[0].Groups[1].ToString();
        else
            return "--";
    }

However, I was thrown a loop when given this:

<div id=menu>
  <p>A hamburger without cheese costs $5.</p>
  <p>A cheeseburger with one patty costs $6.</p>
  <p>A cheeseburger (SPECIAL: add an additional patty for $1 each) costs $6.</p>
</div>

The appearance of a second dollar sign in "add a second patty for $1" threw me for a total loop. I've researched and tried a number of things like using patterns and at this point I've totally confused myself.

Is there a regular expression that will find out how much a cheeseburger costs whether there is a special or not?

Community
  • 1
  • 1
Jeff H
  • 81
  • 4

1 Answers1

4

NO..NO..NO..

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format..

Use htmlagilitypack

Regex is used for Regular expression NOT Irregular expression

You can use this code to retrieve it like this

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var itemList = doc.DocumentNode.SelectSingleNode("//div[@id='menu']")
                  .Elements("p")
                  .Select(p => p.InnerText)
                  .ToList();

foreach(var item in itemList)
{
Match m= Regex.Match(item,@"(?<name>[Aa]?\s*.*?)\s.*?(?<price>\$\d+).*");
    if(m.Success==true)
     {
            m.Groups["name"].Value;
            m.Groups["price"].Value;
     }
}

The regex would be

(?<name>[Aa]?\s*.*?)\s.*?(?<price>\$\d+).*

Group1 captures the name

Group2 captures the price

carla
  • 1,970
  • 1
  • 31
  • 44
Anirudha
  • 32,393
  • 7
  • 68
  • 89
  • OK using regex to parse html is not a good thing. But how would you parse `A cheeseburger (SPECIAL: add an additional patty for $1 each) costs $6.` with HtmlAgilityPack to get the price? – L.B Oct 19 '12 at 19:35
  • 2
    +1 "Irregular expression" ha ha ha! – Chuck Conway Oct 19 '12 at 19:53
  • OK, thanks. I did look at the HTMLAgilityPack (and should have mentioned that) site on codeplex and didn't see any examples for parsing the innertext of html tags. This example helps. Thanks! – Jeff H Oct 19 '12 at 19:58
  • @Anirudha The only point is that your linq code wouldn't compile. – L.B Oct 19 '12 at 20:01
  • @Anirudha Why don't you try? Do you expect me to fix it? – L.B Oct 19 '12 at 20:02
  • Anirudha, I'm getting an error that I believe LB is referring to in your Where clause. Intellisense says that there is no definition for Attribute. When changing it to "Attributes", IS says cannot compare between HtmlAttribute and string. – Jeff H Oct 19 '12 at 20:15
  • @JeffH L.B helped it out..+100 to him – Anirudha Oct 19 '12 at 20:30
  • @L.B appreciate your good help – Anirudha Oct 19 '12 at 20:30
  • @JeffH added the regex code..hope tht helps – Anirudha Oct 19 '12 at 20:46
  • @Anirudha, this is great, thank you. I need to refactor a bit of code but with your help I'm well on my way! – Jeff H Oct 19 '12 at 21:04