Regex select XML Element (containing hyphen) and inside content

Question

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.

What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.

I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.

Here's what I've tried:

(<page-content>)(.*?)

The above will match up until the next starting <page-content> tag, which is not what I want.

(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)

The above finds no matches, even though the below will find the 7 matches it should.

(<content>)(.*?)(<\/content>)

I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.

Thanks!

EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.

Do not parse HTML (XML) with regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — kirilloid, Aug 21 '13 at 15:17
@kirilloid Like I said, I know it should not be done, but it has to be done, otherwise I'll be wasting an hour or so of my time every week cleaning up this one file. If I should not be parsing it with regex, how else can I use a text editor to find and replace all of this? — Josh Allen, Aug 21 '13 at 15:20
Your regex looks fine to me (and matches when I run a simple test). Escaping the hyphen is unnecessary in this context. — Smern, Aug 21 '13 at 15:22
@smerny so this seems to be a limitation of regular expressions. Do you have any idea how I can find and replace all this information? — Josh Allen, Aug 21 '13 at 15:24
Limitation how? It seems to work [here](http://www.debuggex.com/r/pCll8gg8c6WBsY2p/0) — Smern, Aug 21 '13 at 15:27
@JoshAllen nothing wrong with the regex and the data you provided. What your editor wants on the other hand we don't really know. If the content spans multiple lines you may need the `/s` flag tho. Example: http://regex101.com/r/iU1mE1 — Qtax, Aug 21 '13 at 15:27
Aha. That must be it. How would I apply the /s flag in this situation? — Josh Allen, Aug 21 '13 at 15:30
With TextWrangler it looks like you add `(?s)` to the beginning of your regex. — Smern, Aug 21 '13 at 15:33
@smerny, Thank you so much! `()(?s)(.*?)(?s)(<\/page-content>)` ended up being exactly what I needed. If you want to submit an answer I'll mark it as best answer. — Josh Allen, Aug 21 '13 at 15:36
@JoshAllen, the second `(?s)` is unnecessary. `(?s)` just turns on that modifier for anything after it. — Smern, Aug 21 '13 at 15:48

Smern · Accepted Answer · 2013-08-21T15:50:30.090

1

It seems the problem is that your . is not matching newlines that exist between your open and close tags.

An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:

(<page-content>)(?s)(.*?)(<\/page-content>)

More information on modifiers here.

edited Aug 21 '13 at 15:50

answered Aug 21 '13 at 15:38

Smern

18,746
21
72
90

Regex select XML Element (containing hyphen) and inside content

1 Answers1