1

I need a regular expression to match XML start nodes like the following,

  1. normal cases <ref> and <ref name="gbtribune.files.wordpress.com">

  2. empty attribute <ref name="gbtribune.files.wordpress.com" name2> or <ref name="gbtribune.files.wordpress.com" name2= >

  3. missing quotes <ref name=gbtribune.files.wordpress.com> or <ref name="gbtribune.files.wordpress.com> or <ref name=gbtribune.files.wordpress.com">

but I do not want it to match a self-closing nodes <ref/> or <ref name=gbtribune.files.wordpress.com" />

I also want the first group to capture the tag name, and the second group to capture all key-value attribute pairs.

My regex is designed as

<([a-zA-Z]+)\s*([^\/<>"=\s]+=?(?:(?:"(?:[^<>"]*)"?)|(?:[^=<>"\s]*"?))?\s*)*>

You can open it here https://regex101.com/r/TVwye1/3

enter image description here

It works for case 1,2,3, but it also matches the a self-closing nodes. Need help to exclude the self-closing nodes from the matches.

Tony
  • 1,225
  • 3
  • 12
  • 26
  • 2
    Why not using a XML parser? Regex aren't the best choice for xml/html parsing, since they aren't regular. – Cid Oct 14 '18 at 19:48
  • 1
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Cid Oct 14 '18 at 19:48
  • @Cid I tested some of the top solutions in the previous question, none of them work for my case. – Tony Oct 14 '18 at 20:01
  • @Cid My application only allows for using regular expressions. – Tony Oct 14 '18 at 20:01
  • 1
    Add `(?!.*\/>$)` at the start of the [expression](https://regex101.com/r/TVwye1/4)? – Paolo Oct 14 '18 at 20:07
  • 1
    @UnbearableLightness Thanks for the help! It still has problem of making correct matches in strings like `txt `. The non-self-closing node in this example cannot be matched. – Tony Oct 14 '18 at 21:06
  • Your name portion is missing some [allowed characters](https://www.w3.org/TR/xml/#NT-NameStartChar). – Tom Blodget Jun 03 '19 at 23:52

1 Answers1

0

You may use

<(?![^<>]*\/\s*>)([a-zA-Z]+)(?:\s+[^\/<>"=\s]+(?:=(?:"[^<>"]*"?|[^=<>"\s]*"?)?)?)*>

See the regex demo

Details

  • < - < char
  • (?![^<>]*\/\s*>) - a negative lookahead that fails the match if after the current location, there are any 0+ chars other than < and > followed with /, 0+ whitespaces and >
  • ([a-zA-Z]+) - Group 1: one or more ASCII letters
  • (?:\s+[^\/<>"=\s]+(?:=(?:"[^<>"]*"?|[^=<>"\s]*"?)?)?)* - 0 or more repetitions of:
    • \s+ - 1+ whitespaces
    • [^\/<>"=\s]+ - 1+ chars other than /, <, >, ", = and whitespace
    • (?:=(?:"[^<>"]*"?|[^=<>"\s]*"?)?)? - an optional sequence of:
    • = - an equal sign
    • (?:"[^<>"]*"?|[^=<>"\s]*"?)? - an optional sequence of:
      • "[^<>"]*"?| - 0 or more chars other than <, >, " and then an optional " or
      • [^=<>"\s]*"? - 0 or more chars other than =, <, >, " and whitespace and then an optional "
  • > - a > char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563