1

Given that the following string is embedded in text, how can I extract the whole line but not matching on the inner "<" and ">"?

<test type="yippie<innertext>" />

EDIT:
Being more specific, we need to handle both use cases below where "type" has or does not have "<" and ">" chars.

<h:test type="yippie<innertext>" />
<h:test type="yippie">

Group 1: 'h:test'
Group 2: ' type="yippie<innertext>" '  -or-  ' type="yippie"'   (ie, remaining content before ">" or "/>")

So far, I have something like this, but it's a little off how it Group 2 stops at the first ">". Tweaking first part of Group 2's condition.

(<([a-zA-Z0-9_:-]+)([^>"]*|[^>]*?)\s*(/)?>)

Thanks for your help.

cwall
  • 165
  • 1
  • 7

2 Answers2

2

Try this:

<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>

Example usage (Python):

>>> x = '<h:test type="yippie<innertext>" />'
>>> re.search('<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>', x).groups()
('h:test', ' type="yippie<innertext>" ')

Also note that if your document is HTML or XML then you should use an HTML or XML parser instead of trying to do this with regular expressions.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Yep, you're on it. I should have been more clear and complete. I need to group the matching splitting the tag name and the remaining lot. See above. – cwall Apr 23 '10 at 04:03
0

It looks like you are trying to parse XML/HTML with a regex. I would say that your approach is fundamentally wrong. A sufficiently advanced regex is not indistinguishable from an XML parser. After all, what if you needed to parse:

<test type="yippie<inner\"text\"_with_quotes,_literal_slash_and_quote\\\">" />

Furthermore, you probably need to escape the inner < and > as &lt; and &gt;

For further reasons why you should not parse XML with a regex, I can only yield to this superior answer:

RegEx match open tags except XHTML self-contained tags

Community
  • 1
  • 1
eaolson
  • 14,717
  • 7
  • 43
  • 58