Regular expression in C#

Question

I want to parse the second div from the following HTML:

<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>

i.e., this value: <div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>

The id can contain any numbers.

Here is what I am trying:

Regex rgx = new Regex(@"'post-body-\d*'");
var res = rgx.Replace("<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>", "");

I expect the result <div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg> but that is not what I am getting.

Try using a library to parse HTML, regular expressions with HTML are inadvisable. Try htmlagilitypack.codeplex.com/ — James, Aug 03 '12 at 13:56
@James - *very* inadvisable. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Bob Kaufman, Aug 03 '12 at 13:58
please help urself by writing proper question...seems like ur confused about what u need 2 do.. — Anirudha, Aug 03 '12 at 14:06
Are you trying to **retrieve** the inner `
`, or **remove** it? Your question states that you want to "parse" it, but your sample code is actually trying to remove it and retrieve the remainder. — Jon Senchyna, Aug 03 '12 at 14:24

score 1 · Answer 1 · answered Aug 03 '12 at 14:11

If you are 100% certain that the text before and after the number will always be the same, you could use the .IndexOf and .Substring methods of the String class to break-down the string in pieces.

string original = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"

// IndexOf returns the position in the string where the piece we are looking for starts
int startIndex = original.IndexOf(@"<div class='post-body entry-content' id='post-body-");
// For the endIndex, add the number of characters in the string that you are looking for
int endIndex = original.IndexOf(@"' itemprop='articleBody'>") + 25;

// this substring will retrieve just the inner part that you are looking for
string newString = original.Substring(startIndex, endIndex - startIndex);

// newString should now equal "<div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>"


// or, if you want to just remove the inner part, build a different string like this:
// First, get everything leading up to the startIndex
string divString = original.Substring(0, startIndex);
// then, add everything after the endIndex
divString += original.Substring(endIndex);

// divString should now equal "<div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg>"

hope this helps...

score 1 · Accepted Answer · answered Aug 03 '12 at 14:15

The reason you don't get your expected result is that your regex string is only searching for 'post-body-\d*', but not the rest of the div tag. In addition, performing Regex.Replace actually replaces the text that you are searching for, rather than returning it, so you will end up getting everything but the text you are searching for.

Try replacing your regex string with @"<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>" using Regex.Matches (or Regex.Match if you only care about the first occurrence), and processing the Matches.

For example:

string htmlText = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>";

Regex rgx = new Regex(@`"<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>");
foreach (Match match in rgx.Matches(htmlText))
{
    // Process matches
    Console.WriteLine(match.ToString());
}

James · Answer 3 · 2012-08-03T14:31:41.677

0

You could parse your HTML fragment into an XML fragment and pull out the id attribute directly e.g.

var html = "<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"
var data = XElement.Parse(html).Element("div").Attribute("id");

edited Aug 03 '12 at 14:31

answered Aug 03 '12 at 14:03

James

80,725
18
167
237

Regular expression in C#

3 Answers3