0

I have a requirement to parse the content out of Dreamweaver templates. I'm using C#.

Here is some example content that I will need to parse.

<div id="myDiv">
    <h1><!-- InstanceBeginEditable name="PageHeading" -->
    The Heading<!-- InstanceEndEditable --></h1>
    <!-- InstanceBeginEditable name="PageContent" -->
    <p>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed nibh turpis, 
    sagittis vitae convallis at, fringilla nec augue.</p>
    <p>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
    Sed nibh turpis, sagittis vitae convallis at, fringilla nec augue.</p>
    <!-- InstanceEndEditable -->
</div><!-- END #myDiv-->

Dreamweaver templates are based around HTML comments with specific strings denoting their purpose. They key ones for me are as follows, as they denote the start and end of editable regions in the page.

<!-- InstanceBeginEditable name="xxxxxx" -->
<!-- InstanceEndEditable --> 

As you can see from my example HTML, there may be other comments in the source code.

So starting simple, I have the following, which matches all the opening Editable region tags.

<!-- InstanceBeginEditable(.*)?--> 

So next I want to get everything between there and the next "

<!-- InstanceBeginEditable(.*)?-->(?<content>(.*)?)<!-- InstanceEnd

Can you tell me why this is so. I would have thought a non-greedy capture (.*)? in-between my already working code and the literal

<!—InstanceEnd

would have matched what I need...

Greg B
  • 14,597
  • 18
  • 87
  • 141

2 Answers2

1

You don't want to put parentheses around .*.

This means to grab everything greedily, or not.

(.*)?

This means to grab everything lazily:

.*?

Also, in your regex, you have only one - in the ending token. Change it to this:

<!-- InstanceBeginEditable.*?-->(?<content>.*?)<!-- InstanceEnd

By the way, it's dangerous to have two .*s in a regex without an atomic group. On unexpected data, you can get catastrophic backtracking. I'd recommend changing the first .*? to [^-]*. And, while I'm at it, I'll suggest you handle whitespace more forgivingly:

<!--\s*InstanceBeginEditable[^-]*-->(?<content>.*?)<!--\s*InstanceEnd

You probably already know this, but let me add that with .NET, you'll need to use RegexOptions.Singleline.

Jeremy Stein
  • 19,171
  • 16
  • 68
  • 83
  • Hi Jeremy, the single - in the end token was curtesy of Word, but thanks for noticing! – Greg B Oct 20 '09 at 18:45
  • Thanks for the info on greadyness/lazyness. I had thought of using \s for white space but while I was trying to get it working I thought I'd keep it simple with a literal SPACE. Cheers – Greg B Oct 20 '09 at 18:52
0

Use the HTML Agility Pack, see my answer here, How do I parse HTML using regular expressions in C#?

Community
  • 1
  • 1
nickytonline
  • 6,855
  • 6
  • 42
  • 76
  • Does HTML Agility support sections surrounded with special comments as per this question? I'm already trying to use agility for this but can't how to select anything other than normal nodes. – Daniel Revell Sep 28 '10 at 11:23
  • Readers - Also see [Does HtmlAgilityPack have the ability to use regular expressions in its XPATH selector?](https://stackoverflow.com/a/11729611/943435) – Yogi Apr 27 '19 at 18:57