0

I want to parse the second div from the following HTML:

<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>

i.e., this value: <div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>

The id can contain any numbers.

Here is what I am trying:

Regex rgx = new Regex(@"'post-body-\d*'");
var res = rgx.Replace("<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>", "");

I expect the result <div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg> but that is not what I am getting.

Chris Dargis
  • 5,891
  • 4
  • 39
  • 63
revolutionkpi
  • 2,632
  • 10
  • 45
  • 84

3 Answers3

1

If you are 100% certain that the text before and after the number will always be the same, you could use the .IndexOf and .Substring methods of the String class to break-down the string in pieces.

string original = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"

// IndexOf returns the position in the string where the piece we are looking for starts
int startIndex = original.IndexOf(@"<div class='post-body entry-content' id='post-body-");
// For the endIndex, add the number of characters in the string that you are looking for
int endIndex = original.IndexOf(@"' itemprop='articleBody'>") + 25;

// this substring will retrieve just the inner part that you are looking for
string newString = original.Substring(startIndex, endIndex - startIndex);

// newString should now equal "<div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>"


// or, if you want to just remove the inner part, build a different string like this:
// First, get everything leading up to the startIndex
string divString = original.Substring(0, startIndex);
// then, add everything after the endIndex
divString += original.Substring(endIndex);

// divString should now equal "<div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg>"

hope this helps...

1

The reason you don't get your expected result is that your regex string is only searching for 'post-body-\d*', but not the rest of the div tag. In addition, performing Regex.Replace actually replaces the text that you are searching for, rather than returning it, so you will end up getting everything but the text you are searching for.

Try replacing your regex string with @"<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>" using Regex.Matches (or Regex.Match if you only care about the first occurrence), and processing the Matches.

For example:

string htmlText = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>";

Regex rgx = new Regex(@`"<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>");
foreach (Match match in rgx.Matches(htmlText))
{
    // Process matches
    Console.WriteLine(match.ToString());
}
Jon Senchyna
  • 7,867
  • 2
  • 26
  • 46
0

You could parse your HTML fragment into an XML fragment and pull out the id attribute directly e.g.

var html = "<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"
var data = XElement.Parse(html).Element("div").Attribute("id");
James
  • 80,725
  • 18
  • 167
  • 237