1

I have this string:

This is sample <p id="short"> the value of short </p> <p id="medium"> the value of medium </p> <p id="large"> the value of large</p>

which I want to break into 3 pieces:

  • string before p tags : this is sample
  • short : the value of short
  • medium: the value of medium
  • large: the value of large
KatieK
  • 13,586
  • 17
  • 76
  • 90
kobe
  • 15,671
  • 15
  • 64
  • 91
  • 1
    Why everyone wants to parse HTML with regex ?!? – hsz May 06 '11 at 23:06
  • @because the content comes from a different third party . – kobe May 06 '11 at 23:08
  • 1
    Is the format of the

    element consistent or do you need a pattern that is sensitive to other possible attributes?

    – csano May 06 '11 at 23:10
  • yeah it always comes with p tags only – kobe May 06 '11 at 23:12
  • Also, how about giving examples of a few approaches you've tried that didn't work? – csano May 06 '11 at 23:12
  • @kobe What about attributes? Just the id attribute? – csano May 06 '11 at 23:13
  • 2
    @kobe as mentioned in [an infamous post](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454), you should be using something like [HTML Agility Pack](http://www.codeplex.com/htmlagilitypack), not regex – brianpeiris May 06 '11 at 23:13
  • I would agree with brianperris. If not the HTML Agility pack, then some basic string manipulation would do the trick as well, especially if the P tags are always the same structure. – Duncan Howe May 06 '11 at 23:19

6 Answers6

4

If you don't mind a non-regex solution (because HTML is not a regular language) you can use this

string input = @"This is sample <p id=""short""> the value of short </p> <p id=""medium""> the value of medium </p> <p id=""large""> the value of large</p>";


string before = input.Substring(0, input.IndexOf("<"));
string xmlWrapper = "<html>" + input.Substring(input.IndexOf("<")) + "</html>";
XElement xElement = XElement.Parse(xmlWrapper);

var shortElement =
    xElement.Elements().Where(p => p.Name == "p" && p.Attribute("id").Value == "short").SingleOrDefault();
var shortValue = shortElement != null ? shortElement.Value : string.Empty;

var mediumElement =
    xElement.Elements().Where(p => p.Name == "p" && p.Attribute("id").Value == "medium").SingleOrDefault();
var mediumValue = shortElement != null ? shortElement.Value : string.Empty;

var largelement =
    xElement.Elements().Where(p => p.Name == "p" && p.Attribute("id").Value == "large").SingleOrDefault();
var largeValue = shortElement != null ? shortElement.Value : string.Empty;
Bala R
  • 107,317
  • 23
  • 199
  • 210
3

Here's my stab at it:

var regex = new Regex("(?<text>.*?)<p.*?>(?<small>.*?)</p>.*<p.*?>(?<medium>.*?)</p>.*.*<p.*?>(?<large>.*?)</p>.*");
var htmlsnip = @"This is sample <p id=""short""> the value of short </p> <p id=""medium""> the value of medium </p> <p id=""large""> the value of large</p>";

var match = regex.Match(htmlsnip);
var text = match.Groups["text"].Value;
var small = match.Groups["small"].Value;
var medium = match.Groups["medium"].Value;
var large = match.Groups["large"].Value;
Richard Nienaber
  • 10,324
  • 6
  • 55
  • 66
2
(?<string_before_p_tags>[^<]*)<p id="short">(?<short>.*)</p>\s*<p id="medium">(?<medium>.*)</p>\s*<p id="large">(?<large>.*)</p>

Returns the named capture groups:

string_before_p_tags: This is sample
short: the value of short
medium: the value of medium
large: the value of large

Town
  • 14,706
  • 3
  • 48
  • 72
1

Building on Bala R's answer, here's a more succinct way to do it with XPath:

string input = @"This is sample <p id=""short""> the value of short </p> <p id=""medium""> the value of medium </p> <p id=""large""> the value of large</p>";
var xmlWrapper = "<html>" + input + "</html>";
var elements = XElement.Parse(xmlWrapper).XPathSelectElements("/*").ToList();

var text = elements[0].PreviousNode.ToString();
var small = elements[0].Value;
var medium = elements[1].Value;
var large = elements[2].Value;
Richard Nienaber
  • 10,324
  • 6
  • 55
  • 66
0

First of all, it was said many times here that you should not use regex for parsing html, for several reasons (mainly that html is not a regular language) and you should use an HTML parser.

However, if for whatever constraints you cant use an HTML parser you can do the folowing:

1. string before p tags - \w[^<]

2. short - <p id="short"> [\w|\s]* [^<]

3. medium - <p id="medium"> [\w|\s]* [^<]

4. large - <p id="large"> [\w|\s]* [^<]

Cheers.

Michael
  • 2,910
  • 3
  • 15
  • 26
0

Using the HtmlAgilityPack its is very simples:

 string html = "This is sample <p id=\"short\"> the value of short </p> <p id=\"medium\"> the value of medium </p> <p id=\"large\"> the value of large</p>";
            string id = null;
            NameValueCollection output = new NameValueCollection();
            string[] pIds = new string[3] { "short", "medium", "large" };
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);
            int c = 0;
            int len = pIds.Length;
            while (c < len)
            {
                id = pIds[c];
                output.Add(id, doc.GetElementbyId(id).InnerHtml);
                c++;
            }

       //In key of output variable, is equivalent to value of paragraph. example:
        Console.WriteLine(output["short"].ToString()); 

Output:the value of short

The Mask
  • 17,007
  • 37
  • 111
  • 185