Perl: extract substring that match a pattern between to XML tag

Question

I need to parse an XML file without using module.

In that XML file I need to extract all content between 2 tags (<mi>...</mi>) that match a pattern.

I have this:

$xmlstring = my xml string
$pattern = "G2_CPU";
my $regex = "<mi>(.*?" . $pattern . ".*?)<\\/mi>";
my ($data) = $xmlstring =~ /$regex/i;

But when I execute it, in $data I got everything between the very first <mi> tag and the very last </mi> tag.

I also try with the regex without variable: /(<mi>.*?G2_CPU.*?<\/mi>)/ and I got the same result.

How can I do it?

You have `$compteur` in the regex and you talk about `$pattern`. Are you sure you have the proper variable naming? — fedorqui, Mar 06 '15 at 12:50
Sorry, wrong cut/paste, please replace `$pattern = "G2_CPU"` by `$compteur = "G2_CPU"` — Maximilien, Mar 06 '15 at 12:50
XML is not a regex parsable structure. It's very hazardous to go down this road, because it'll create brittle code that'll mysteriously break one day. — Sobrique, Mar 06 '15 at 13:25
@Maximilien: What are your reasons for avoiding a proper XML parser? You really can't do this reliably using regular expressions. — Borodin, Mar 06 '15 at 14:13
This script will be run on an telecom equipment on which I cannot install perl module — Maximilien, Mar 06 '15 at 15:28
If you can put your script on said equipment, you can also put at least any non-XS module on said equipment. — Sinan Ünür, Mar 06 '15 at 16:06

score 3 · Answer 1 · edited May 23 '17 at 12:08

3

Assuming this is still valid XML, i.e. < cannot appear between tag open and tag close, and there is no CDATA within those tags, you can just use:

my $re = qr{<mi>([^<]*? \Q$pattern\E [^<]*?)</mi>}ix;

That is, instead of allowing any character up to the substring of interest, allow just non-tag opening characters.

Also, my first instinct, if I ever thought I would try to go down the rabbit hole of parsing XML without a decent XML parser, would have been to first extract the text between <mi>...</mi> and then check if it contains what I am looking for.

edited May 23 '17 at 12:08

Community

1
1

answered Mar 06 '15 at 13:09

Sinan Ünür

116,958
15
196
339

I cannot use you solution, cause I have other tags inside ``and `` – Maximilien Mar 06 '15 at 14:00

score 1 · Accepted Answer · edited May 23 '17 at 11:52

1

You just need to add a greedy match in the beginning of the pattern, so that it catches the most of it:

my $regex = "(?:.*)<mi>(.*?" . $compteur . ".*?)<\/mi>";
             ^^^^^^

From Shortest match issues:

The problem is that even with non-greedy matching, Perl is still trying to find the match that starts at the leftmost possible point in the string.

Test

File p.pl:

$xmlstring = "hello <mi>first mi</mi> and this is another <mi>second mi</mi> end." ;
$compteur="second";
my $regex = "(?:.*)<mi>(.*?" . $compteur . ".*?)<\/mi>";
my ($data) = $xmlstring =~ /$regex/i;
print "$data\n";

Execution:

$ perl p.pl 
second mi

edited May 23 '17 at 11:52

Community

1
1

answered Mar 06 '15 at 12:59

fedorqui

275,237
103
548
598

This will match `aaa second bbb` i.e. it will find the pattern between *any* two opening and closing `mi` tags, not necessarily two that pair with one another. – Borodin Mar 06 '15 at 14:20
1

@Borodin Uhms, I see, hadn't thought about it. Do you see any way to improve it? I am not that familiar with Perl regex. – fedorqui Mar 06 '15 at 14:23
3

@fedorqui: I'm afraid I'm in the camp that says you shouldn't be trying to do this with regexes anyway! You could use something like `(?:.(?!))` to match any character that isn't followed by ``. – Borodin Mar 06 '15 at 14:54

Perl: extract substring that match a pattern between to XML tag

2 Answers2

Test