0

from the bellow tags i want to select only some specific tag (tagA|tagB) details which doesn't have any "id" attribute by using regex.

<span class="online"><tagA xmlns="http://www.xyz.com/xml/ja/dtd">A1</tagA><tagB id="tg1" xmlns="http://www.xyz.com/xml/ja/dtd">B1</tagB></span>
<span class="online"><tagA id="tg2" xmlns="http://www.xyz.com/xml/ja/dtd">A2</tagA><tagB xmlns="http://www.xyz.com/xml/ja/dtd">B2</tagB></span>
<tagA id="tg3" xmlns="http://www.xyz.com/xml/ja/dtd">A3</tagA>
<tagB id="tg4" xmlns="http://www.xyz.com/xml/ja/dtd">B3</tagB>
<tagC id="tg5" xmlns="http://www.xyz.com/xml/ja/dtd">C1/tagC>
<tagA xmlns="http://www.xyz.com/xml/ja/dtd">A4</tagA>
<tagB xmlns="http://www.xyz.com/xml/ja/dtd">B4</tagB>
<tagC xmlns="http://www.xyz.com/xml/ja/dtd">C2</tagC>
<tagA>A5</tagA>
<tagB>B5</tagB>
<tagC>C3</tagC>
<span class="online"><i><tagA xmlns="http://www.xyz.com/xml/ja/dtd">A6</tagA></i><b><tagB id="tg6" xmlns="http://www.xyz.com/xml/ja/dtd">B6</tagB></b></span>
<span class="online"><i><tagA id="tg7" xmlns="http://www.xyz.com/xml/ja/dtd">A7</tagA></i><b><tagB xmlns="http://www.xyz.com/xml/ja/dtd">B7</tagB></b></span>

as a result i should get only the details of :

<tagA xmlns="http://www.xyz.com/xml/ja/dtd">A1</tagA>
<tagB xmlns="http://www.xyz.com/xml/ja/dtd">B2</tagB>

<tagA xmlns="http://www.xyz.com/xml/ja/dtd">A4</tagA>
<tagB xmlns="http://www.xyz.com/xml/ja/dtd">B4</tagB>
<tagA>A5</tagA>
<tagB>B5</tagB>

<tagA xmlns="http://www.xyz.com/xml/ja/dtd">A6</tagA>
<tagB xmlns="http://www.xyz.com/xml/ja/dtd">B7</tagB>
pks
  • 101
  • 2
  • 16
  • 1
    Are non-regexp solutions acceptable? – Álvaro González Nov 23 '12 at 13:26
  • XML parsing with regexps is often not a good idea as XML is not a regular language and therefore can't be parsed with regular expressions. See [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) for the results of using regexps for XML parsing. Having said that, for very limited and well-defined cases like this one, it would probably work as in the answer below. – Inductiveload Nov 23 '12 at 13:29

3 Answers3

2

This regex would match even if it is nested

<(?!.*?\sid=)(.*?)(\s+.*?)?>.*?</\1>

.*? matches 0 to many characters lazily

(?!.*?id=) is a negative lookahead that checks whether there is an id attribute and if it is will not match further

anything matched within () is captured in a group

\1 refers to the first matched group..

You can try it here

Anirudha
  • 32,393
  • 7
  • 68
  • 89
1

Use regex pattern

<(\S+)(?![^<>]*\bid=).*?<\/\1>
Ωmega
  • 42,614
  • 34
  • 134
  • 203
0

Here's how I would do it:

/<(tag[A-Z]+)(?:\s+(?!id=)\w+="[^"]+")*>\w+<\/\1>/i

Breaking it down:

  • <(tag[A-Z]+) matches the opening tag and captures its name in group #1

  • (?:\s+(?!id=)\w+="[^"]+")* consumes the attributes one at a time, after checking that the attribute's name is not id

  • >\w+</\1> finishes off the opening tag, then consumes the content and the closing tag

You may need to tweak parts of it, especially the \w+ sequences. Lacking familiarity with your data, I tossed those in to serve as placeholders.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156