0

i want a regex to find out specific html tag details.

i tried bellow 2 regex :

<\s*tag[^>]*>(.*?)<\s*/\s*tag>

<tag[^<>]*>.+?<\/tag>

bellow are the 2 test cases for 1st regex :

in 1st example i am getting correct result but in the example 2 i am getting wrong result. where in both the cases in-puts are almost same.

1st case : all are as individual string and 2nd case : as a single string.

===================================
Example 1 Input
===================================
<tagX>AAA</tagX>
<tag>GGG</tag>
<tag id="tag896">HHH</tag>
<tagY>III</tagY>
<tag id="tag017">JJJ</tag>
<tag>KKK</tag>
===================================
Output 1 // Correct
===================================
<tag>GGG</tag>
GGG
<tag id="tag896">HHH</tag>
HHH
<tag id="tag017">JJJ</tag>
JJJ
<tag>KKK</tag>
KKK


===================================
Example 2 Input (as a single string)
===================================
<tagX>AAA</tagX><tag>GGG</tag><tag id="tag896">HHH</tag><tagY>III</tagY><tag id="tag017">JJJ</tag><tag>KKK</tag>
===================================
Output 2 // Wrong
===================================
<tagX>AAA</tagX><tag>GGG</tag>
AAA</tagX><tag>GGG

<tag id="tag896">HHH</tag>
HHH

<tagY>III</tagY><tag id="tag017">JJJ</tag>
III</tagY><tag id="tag017">JJJ

<tag>KKK</tag>
KKK

here exactly i want the details of (tag) but in 2nd case its fetching (tag) + (tagX) + (tagY) details.

my input is similar to 2nd input...

its lil urgent... can i get a solution for this.

thanks...

pks
  • 101
  • 2
  • 16

3 Answers3

1

Your problem in the Regular Expressions you've written, is that you allow <tagX> (for example) to be the opening tag if there's `' that's supposedly closes it on the same line.

Your problem with using Regular Expressions in this case, is that you might get a bad result if the XML is:

<tag></tag>
<tagX></tagX>
<tag></tag>

If all the tags are inline, you could get the whole thing, so be very careful.

I'd work with something like (this works with the above example):

 <\s*tag((\s+[^<>]+\s*>)|(\s*>))[^<>]*<\s*\/tag\s*>

Here, I allow all the whitespaces which are valid, but I don't allow nested tags, so the above example will work. Moreover, If you allow nested tags, no REGEX will work. Look at this example:

<tag> <tagX> <tag> </tag> </tagX> </tag>

Though, in this example, you will get <tag> <tagX> <tag> </tag> as a valid answer.

EZLearner
  • 1,614
  • 16
  • 25
1

I tried the regex below and its working fine...

<tag( [^<>]+)?>(.+?)<\/tag>
slfan
  • 8,950
  • 115
  • 65
  • 78
pks
  • 101
  • 2
  • 16
0

If you are using .NET (and for some reason, you are sure about your XML and don't need to use Html Agility Pack), you may try this:

<tag(?:>|(?: .*?>))(.*?)</tag>
Alex Filipovici
  • 31,789
  • 6
  • 54
  • 78