0

I need to parse an XML file without using module.

In that XML file I need to extract all content between 2 tags (<mi>...</mi>) that match a pattern.

I have this:

$xmlstring = my xml string
$pattern = "G2_CPU";
my $regex = "<mi>(.*?" . $pattern . ".*?)<\\/mi>";
my ($data) = $xmlstring =~ /$regex/i;

But when I execute it, in $data I got everything between the very first <mi> tag and the very last </mi> tag.

I also try with the regex without variable: /(<mi>.*?G2_CPU.*?<\/mi>)/ and I got the same result.

How can I do it?

fedorqui
  • 275,237
  • 103
  • 548
  • 598
Maximilien
  • 49
  • 1
  • 8
  • You have `$compteur` in the regex and you talk about `$pattern`. Are you sure you have the proper variable naming? – fedorqui Mar 06 '15 at 12:50
  • Sorry, wrong cut/paste, please replace `$pattern = "G2_CPU"` by `$compteur = "G2_CPU"` – Maximilien Mar 06 '15 at 12:50
  • 1
    XML is not a regex parsable structure. It's very hazardous to go down this road, because it'll create brittle code that'll mysteriously break one day. – Sobrique Mar 06 '15 at 13:25
  • @Maximilien: What are your reasons for avoiding a proper XML parser? You really can't do this reliably using regular expressions. – Borodin Mar 06 '15 at 14:13
  • This script will be run on an telecom equipment on which I cannot install perl module – Maximilien Mar 06 '15 at 15:28
  • If you can put your script on said equipment, you can also put at least any non-XS module on said equipment. – Sinan Ünür Mar 06 '15 at 16:06

2 Answers2

3

Assuming this is still valid XML, i.e. < cannot appear between tag open and tag close, and there is no CDATA within those tags, you can just use:

my $re = qr{<mi>([^<]*? \Q$pattern\E [^<]*?)</mi>}ix;

That is, instead of allowing any character up to the substring of interest, allow just non-tag opening characters.

Also, my first instinct, if I ever thought I would try to go down the rabbit hole of parsing XML without a decent XML parser, would have been to first extract the text between <mi>...</mi> and then check if it contains what I am looking for.

Community
  • 1
  • 1
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
1

You just need to add a greedy match in the beginning of the pattern, so that it catches the most of it:

my $regex = "(?:.*)<mi>(.*?" . $compteur . ".*?)<\/mi>";
             ^^^^^^

From Shortest match issues:

The problem is that even with non-greedy matching, Perl is still trying to find the match that starts at the leftmost possible point in the string.

Test

File p.pl:

$xmlstring = "hello <mi>first mi</mi> and this is another <mi>second mi</mi> end." ;
$compteur="second";
my $regex = "(?:.*)<mi>(.*?" . $compteur . ".*?)<\/mi>";
my ($data) = $xmlstring =~ /$regex/i;
print "$data\n";

Execution:

$ perl p.pl 
second mi
Community
  • 1
  • 1
fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • This will match `aaa second bbb` i.e. it will find the pattern between *any* two opening and closing `mi` tags, not necessarily two that pair with one another. – Borodin Mar 06 '15 at 14:20
  • 1
    @Borodin Uhms, I see, hadn't thought about it. Do you see any way to improve it? I am not that familiar with Perl regex. – fedorqui Mar 06 '15 at 14:23
  • 3
    @fedorqui: I'm afraid I'm in the camp that says you shouldn't be trying to do this with regexes anyway! You could use something like `(?:.(?!))` to match any character that isn't followed by ``. – Borodin Mar 06 '15 at 14:54