Edit html document using regex replace and matching contents of only immediate child

Question

I have html that looks like so:

<ul style="list-style-type: square;">
<br />  
<li margin-left="80px">
    <br />first line        
    <br />
    <br />second line
</li>
<br />
<li margin-left="80px">
    <br />text line 1
</li>
<br />  
<li margin-left="80px">
    <br />text line 2
</li>
<br />
</ul>

I want to match contents of the ul, but I don't want to match contents of the li elements The end goal is to get rid of the   tags that are directly under the <ul></ul> and not under the <li></li>

Note:For clarity of the example I did formate the above html, but in my real world scenario it comes as a single giant string without any /r/n's

here:

[What is the nature of the Services?] [What are the overarching goals, objectives and outcomes you want to achieve?] [How should the Services be delivered?] <ul style="list-style-type: square;"> <li margin-left="80px"> gfhsdfsdf some line here</li> <li margin-left="80px"> sfdsfsdfsdf</li> <li margin-left="80px"> sdfsdfsdf</li> </ul> [Is the appointment of this Supplier exclusive?] [Refer to any proposal prepared by the Supplier if this helps describes any aspects of the Service]

Anyway the first thing in my mind was to

use this to extract the contents of the <ul> <ul[^>]*>(.*)</ul>

and then maybe do a subsequent one to select all the li <li[^>]*>.*</li>

and then somehow get rid of anything else that's left over

but that's kind of lame and then again

<li[^>]*>.*</li>

matches whole bunch of li's

this entrie string gets captured: <li margin-left="80px"> \t\tgfhsdfsdf \t\tsome line here</li> \t<li margin-left="80px"> \t\tsfdsfsdfsdf</li> \t<li margin-left="80px"> \t\tsdfsdfsdf</li>

i know it's because dot is greedy, but not sure how to avoid it something like [^</li>]* wouldn't work cuz it treats it like list of characters not a string

any help much appreciated

So I have 2 problems 1) i don't like the way I'm approaching this - better ideas needed (I'm considering using set operations of linq to xml to achieve this) - still hope to do this with regex, but if anyone knows exactly how to do this then please share

2) how do I capture separate groups of lis instead of capturing entire first opening <li> and last closing </li>?

Which language/platform are you using? Are you sure that the aforementioned language/platform doesn't provide means of doing such a thing? — FailedDev, Nov 11 '11 at 12:52

score 1 · Answer 1 · edited May 23 '17 at 11:55

1

I think you should go look at this... RegEx match open tags except XHTML self-contained tags

Then recognize that parsing html with a regex is not quite that easy. personally I would load the html in to an html dom object then crawl the document... you might look at this project for some help.

http://htmlagilitypack.codeplex.com/

edited May 23 '17 at 11:55

Community

1
1

answered Nov 11 '11 at 13:10

John Sobolewski

4,512
1
20
26

Yeah your'e right I ended up loading it up into an XDocument and using Linq to navigate it – ambidexterous Dec 01 '11 at 22:54

score 0 · Answer 2 · answered Nov 11 '11 at 14:43

Since you don't say which regex flavor you're using, here's a JavaScript-compatible regex to match a   that's inside a <ul> element but not inside a <li> element:

<br\s*/>(?=[^<]*(?:<(?!/?ul\b)[^<]*)*</ul>)(?![^<]*(?:<(?!/?li\b)[^<]*)*</li>)

Breaking that down,

<br\s*/> matches the BR tag, of course.
(?=[^<]*(?:<(?!/?ul\b)[^<]*)*</ul>) looks ahead for the next occurrence of </ul>, but only if it doesn't encounter a <ul> tag first.
(?![^<]*(?:<(?!/?li\b)[^<]*)*</li>) does the same thing with </li> and <li> tags, but this time negating the result.

Being JS compatible, this should work in Dreamweaver as well as in editors with solid regex support, like EditPad and TextMate. It's also compatible with most Perl-derived flavors (Python, .NET, Java, etc.), though some syntactic tweaking will probably be needed.

Edit html document using regex replace and matching contents of only immediate child

2 Answers2