0

I have some text which I would like to match based on tag only appears once. Text is as below (some random chars can contain anything except for tags):

<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>

The match I want is: to match tag3 within tag2 which only appears once.

For example:

<tag2><tag3>something</tag3></tag2> is matched
<tag2><tag3>something</tag3><tag3>something</tag3></tag2> isn't matched

Based on above text, the expected output is: line 2 and 5.

The regex I tried (didn't work):

<tag2><tag3>(.*)?</tag3></tag2>
<tag2><tag3>(.*){1}</tag3></tag2>
dellair
  • 427
  • 4
  • 22
  • 3
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Lucas Trzesniewski Jun 16 '16 at 07:44
  • Is this XML, HTML or something that just looks like it? The duplicate may be valid, but there are cases where a pragmatic regex is fine. – simbabque Jun 16 '16 at 07:50
  • I don't know if it matters, the tags can be any special characters, so it is a regex question to me. This is XML btw. :) – dellair Jun 16 '16 at 07:51
  • Also, are we talking PCRE regex, or do you have an actual Perl program? – simbabque Jun 16 '16 at 07:51
  • It does matter. `XML` is a contextual language. Regex cannot do context. therefore it's a terrible solution. – Sobrique Jun 16 '16 at 07:59
  • Thanks all above! simbabque has the right answer. – dellair Jun 16 '16 at 08:04
  • 2
    @Sobrique if it's a pattern as simple as this, and the format of the input file is always the same, a regex-based solution is totally fine in my opinion. Sometimes it's ok to be pragmatic. – simbabque Jun 16 '16 at 08:05
  • If it were not trivial to parse as `XML` using `xpath` I might agree. As is, this is building brittle code. – Sobrique Jun 16 '16 at 09:11

3 Answers3

4

I would urge you not to use regular expressions to manipulate XML - ever. Regular expressions cannot handle a contextual language like XML, and as a result you build brittle code - that a perfectly valid alteration to XML format (such as whitespacing) might break.

So instead:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;

my $twig = XML::Twig->parse( \*DATA );

foreach my $element ( $twig->get_xpath('//tag2') ) {
   if ( scalar $element->children('tag3') == 1 ) {
      $element->print;
      print "\n";
   }
}

__DATA__
<root>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
<tag1><tag2><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3><tag3>Some randome chars</tag3></tag2></tag1>
</root>

This will handle XML formatted as you have, but also just on a single line. Or like this:

<root>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
      <tag3>Some randome chars</tag3>
    </tag2>
  </tag1>
</root>

Or like this:

<root
><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1><tag1
><tag2
><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3><tag3
>Some randome chars</tag3></tag2></tag1></root>

Which are semantically identical to yours.

Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • Thanks Sobrique, this may also work, but my real case was a bit complicated and it wasn't some kind translated XML mapping. – dellair Jun 16 '16 at 16:19
  • 1
    You said in the comments that you were working with XML. If you are, then as complication increases, the case for an XML parser does too. – Sobrique Jun 16 '16 at 16:36
2

Your regex didn't work because you were allowing everything (.) in your capture group. That is very greedy and will go as far as possible and only stop at the last </tag3>. If you want to match only stuff that cannot inlcude tags, you need to match anything but an opening tag token.

m{<tag2><tag3>([^<]+)</tag3></tag2>}g

Try it on regex101.com.

simbabque
  • 53,749
  • 8
  • 73
  • 136
  • 1
    Cheers, @simbabque, you are absolutely right. And I also worked out a solution: ((?!tag).)*<\/tag3><\/tag2> – dellair Jun 16 '16 at 08:03
1

Use an XML aware tool. I tried the following in xsh, a wrapper around XML::LibXML:

ls //tag2[1=count(tag3)]

After adding line numbers to the tag2's, I got

<tag2>2<tag3>Some randome chars</tag3></tag2>
<tag2>5<tag3>Some randome chars</tag3></tag2>
choroba
  • 231,213
  • 25
  • 204
  • 289