1

I need to carry out a task that is to get some html out from a webpage. Within the webpage there are comments and i need to get the html out from within the comments. I hope the example below can help. I need it to be done in c#.

<!--get html from here-->
<div><p>some text in a tag</p></div>
<!--get html from here-->

I want it to return

<div><p>some text in a tag</p></div>

How would I do this??

Brad Mace
  • 27,194
  • 17
  • 102
  • 148
gasman
  • 19
  • 1

4 Answers4

2

What about finding the index of the first delimiter, the index of the second delimiter and "cropping" the string in between? Sounds way simpler, might be as much effective as.

Manrico Corazzi
  • 11,299
  • 10
  • 48
  • 62
2

Regexes are not ideal for HTML. If you really do want to process the HTML in all its glory, consider HtmlAgilityPack as discussed in this question. Looking for C# HTML parser

The Simplest Thing That Could Possibly Work is:

string pageBuffer=...;
string wrapping="<!--get html from here-->";
int firstHitIndex=pageBuffer.IndexOf(wrapping) + wrapping.Length;
return pageBuffer.Substring( firstHitIndex, pageBuffer.IndexOf( wrapping, firstHitIndex) - firstHitIndex));

(with error checking that both markers are present)

Depending on your context, WatiN might be useful (not if you're in a server, but if you're on the client side and doing something more interesting that could benefit from full HTML parsing.)

Community
  • 1
  • 1
Ruben Bartelink
  • 59,778
  • 26
  • 187
  • 249
2

If all the instances are similarly formatted, an expression like this

<!--[^(-->)]*-->(.*)<!--[^(-->)]*-->

would retrieve everything between two comments. If your "get html from here" text in your comments is well defined, you could be more specific:

<!--get html from here-->(.*)<!--get html from here-->

When you run the RegEx over the string, the Groups collection would contain the HTML between the comments.

Ben Von Handorf
  • 2,326
  • 1
  • 15
  • 17
  • That's wrong. `[^(-->)]` is a character class that matches any **one** character except one of `( ) - >`. You're probably thinking of a lookahead: `(?:(?!-->).)*` - zero or more of any character, unless the next three characters are `-->`. It's a very common mistake. – Alan Moore Nov 12 '09 at 14:12
  • You should probably also use the lazy quantifier *? for your captured expression since * is greedy and will happily eat a bunch of comments until it reaches the last one in the document. – Michael Petito Nov 12 '09 at 15:15
0

I encountered with such a requirement to strip off HTML comments. I had been looking for some regular expression based solution so that it can work out of the box with free style commenting and having any type of characters under them.

I tried with it and it worked perfectly for single line, multi-line, comments with Unicode character and symbols.

<!--[\u0000-\u2C7F]*?-->
Shoaib Nawaz
  • 2,302
  • 4
  • 29
  • 38