regex to exclude the tag details which having a particular attribute and select only specific tags

Question

from the bellow tags i want to select only some specific tag (tagA|tagB) details which doesn't have any "id" attribute by using regex.

<span class="online"><tagA xmlns="http://www.xyz.com/xml/ja/dtd">A1</tagA><tagB id="tg1" xmlns="http://www.xyz.com/xml/ja/dtd">B1</tagB></span>
<span class="online"><tagA id="tg2" xmlns="http://www.xyz.com/xml/ja/dtd">A2</tagA><tagB xmlns="http://www.xyz.com/xml/ja/dtd">B2</tagB></span>
<tagA id="tg3" xmlns="http://www.xyz.com/xml/ja/dtd">A3</tagA>
<tagB id="tg4" xmlns="http://www.xyz.com/xml/ja/dtd">B3</tagB>
<tagC id="tg5" xmlns="http://www.xyz.com/xml/ja/dtd">C1/tagC>
<tagA xmlns="http://www.xyz.com/xml/ja/dtd">A4</tagA>
<tagB xmlns="http://www.xyz.com/xml/ja/dtd">B4</tagB>
<tagC xmlns="http://www.xyz.com/xml/ja/dtd">C2</tagC>
<tagA>A5</tagA>
<tagB>B5</tagB>
<tagC>C3</tagC>
<span class="online"><i><tagA xmlns="http://www.xyz.com/xml/ja/dtd">A6</tagA></i><b><tagB id="tg6" xmlns="http://www.xyz.com/xml/ja/dtd">B6</tagB></b></span>
<span class="online"><i><tagA id="tg7" xmlns="http://www.xyz.com/xml/ja/dtd">A7</tagA></i><b><tagB xmlns="http://www.xyz.com/xml/ja/dtd">B7</tagB></b></span>

as a result i should get only the details of :

<tagA xmlns="http://www.xyz.com/xml/ja/dtd">A1</tagA>
<tagB xmlns="http://www.xyz.com/xml/ja/dtd">B2</tagB>

<tagA xmlns="http://www.xyz.com/xml/ja/dtd">A4</tagA>
<tagB xmlns="http://www.xyz.com/xml/ja/dtd">B4</tagB>
<tagA>A5</tagA>
<tagB>B5</tagB>

<tagA xmlns="http://www.xyz.com/xml/ja/dtd">A6</tagA>
<tagB xmlns="http://www.xyz.com/xml/ja/dtd">B7</tagB>

XML parsing with regexps is often not a good idea as XML is not a regular language and therefore can't be parsed with regular expressions. See [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) for the results of using regexps for XML parsing. Having said that, for very limited and well-defined cases like this one, it would probably work as in the answer below. — Inductiveload, Nov 23 '12 at 13:29

Anirudha · Answer 1 · 2012-11-23T15:06:00.097

2

This regex would match even if it is nested

<(?!.*?\sid=)(.*?)(\s+.*?)?>.*?</\1>

.*? matches 0 to many characters lazily

(?!.*?id=) is a negative lookahead that checks whether there is an id attribute and if it is will not match further

anything matched within () is captured in a group

\1 refers to the first matched group..

You can try it here

edited Nov 23 '12 at 15:06

answered Nov 23 '12 at 13:26

Anirudha

32,393
7
68
89

Watch out for nested tags: this pattern applied to `DellInc.` will match only `DellInc.`. – Inductiveload Nov 23 '12 at 13:40
Yep, this will allow different nested tags. But it will still break on `DellInc.`. – Inductiveload Nov 23 '12 at 13:50
How about `text` ? – Ωmega Nov 23 '12 at 14:59
1

@Ωmega ur solution is the best..i want to delet the ans but dont want to loose +30 rep ;) – Anirudha Nov 23 '12 at 15:07
thanks for u'r reply. but i've done a little change in requirement. i've updated the original question... – pks Nov 26 '12 at 10:21

Ωmega · Answer 2 · 2012-11-23T14:02:12.983

1

Use regex pattern

<(\S+)(?![^<>]*\bid=).*?<\/\1>

edited Nov 23 '12 at 14:02

answered Nov 23 '12 at 13:51

Ωmega

42,614
34
134
203

thanks for u'r reply. but i've done a little change in requirement. i've updated the original question... – pks Nov 26 '12 at 10:22
@pks - Then go with `<(tag[AB])(?![^<>]*\bid=).*?<\/\1>` – Ωmega Nov 26 '12 at 12:57

score 0 · Answer 3 · answered Nov 26 '12 at 11:29

Here's how I would do it:

/<(tag[A-Z]+)(?:\s+(?!id=)\w+="[^"]+")*>\w+<\/\1>/i

Breaking it down:

<(tag[A-Z]+) matches the opening tag and captures its name in group #1
(?:\s+(?!id=)\w+="[^"]+")* consumes the attributes one at a time, after checking that the attribute's name is not id
>\w+</\1> finishes off the opening tag, then consumes the content and the closing tag

You may need to tweak parts of it, especially the \w+ sequences. Lacking familiarity with your data, I tossed those in to serve as placeholders.

regex to exclude the tag details which having a particular attribute and select only specific tags

3 Answers3