1

I am using the below code for matching a string(EX: <jdgdt\s+mdy=.*?>\s*) which should not be followed by another specific string (<jdg>). But i am unable to get the desired output as per the below code. Can anyone help me regarding this ?

Input file :

<dckt>Docket No. 7677-12.</dckt>
<jdgdt mdy='02/25/2014'>
<jdg>Opinion by Marvel, <e>J.</e></jdg>
<taxyr></taxyr>
<disp></disp>
</tcpar>

<dckt>Docket No. 7237-13.</dckt>
<jdgdt mdy='02/24/2014'>
</tcpar>

Desired Output:

<dckt>Docket No. 7677-12.</dckt>
<jdgdt mdy='02/25/2014'>
<jdg>Opinion by Marvel, <e>J.</e></jdg>
<taxyr></taxyr>
<disp></disp>
</tcpar>

<dckt>Docket No. 7237-13.</dckt>
<jdgdt mdy='02/24/2014'>
<jdg>Opinion by Marvel, <e>J.</e></jdg>
<taxyr></taxyr>
<disp></disp>
</tcpar>

Code:

#/usr/bin/perl

my $filename = $ARGV[0];
my $ext = $ARGV[1];

my $InputFile = "$filename" . "\." . "$ext";

my $document = do {
    local $/ = undef;
    open my $fh, "<", $InputFile or die "Error: Could Not Open File $InputFile: $!";
  <$fh>;
};

$document =~ s/(<jdgdt\s+mdy=.*?>\s*)(?!<jdg>)/$1<jdg>Opinion by Marvel,<e>J.<\/e><\/jdg>\n<taxyr><\/taxyr>\n<disp><\/disp>/isg;

print $document;
Praveen
  • 902
  • 6
  • 21
  • Is this some kind of SGML or XML? – choroba Sep 11 '14 at 15:23
  • `<`, `>`, and `=` do not need to be escaped in perl regex. There's something else wrong, but you should still clean that up. – Brian Stephens Sep 11 '14 at 15:25
  • @choroba : Its a sgml output file . – Praveen Sep 11 '14 at 15:28
  • @BrianStephens : I have edited the code and cleaned up unnecessary escape characters. – Praveen Sep 11 '14 at 15:29
  • Sigh.... http://stackoverflow.com/a/1732454/18157 – Jim Garrison Sep 11 '14 at 15:33
  • So the odd thing about popular answers/quips like that is that people point to them even when they aren't appropriate. No, you can't parse _general_ HTML or SGML with a regex. However, there's no reason why the output of one specific tool shouldn't be manipulable with a regex just because the output happens to also be SGML. – Daniel Martin Sep 11 '14 at 16:02
  • @Daniel Martin, No reason? Here's one reason (and I can think of others): Someone might add ``. You're right that it's not impossible; that's just a shortcut for saying it's more fragile, less readable and harder (at least for XML-based languages) to use regex. – ikegami Sep 11 '14 at 17:16

2 Answers2

3

I had to make two minor adjustments to your regex to get the desired output:

$document =~ s{(<jdgdt\s+mdy\=[^>]*>\s*)(?!\s*<jdg>)}{$1<jdg>Opinion by Marvel,<e>J.</e></jdg>\n<taxyr></taxyr>\n<disp></disp>}isg;

Also, to clean up the code, I switched from using / to using {} to delimit the regex; that way, you don't need to backslash all the slashes that you actually want there in your replacement.

Explanation of what I changed:

First off, negative lookahead is tricky. What you have to remember is that perl will try to match your expression the maximum amount of times possible. Because you had this initially:

/(<jdgdt\s+mdy\=.*?>\s*)(?!<jdg>)/

What would happen is that in that first clause you'd get this match:

<jdgdt mdy='02/25/2014'>\n<jdg>Opinion by Marvel, <e>J.</e></jdg>
^^^^^^^^^^^^^^^^^^^^^^^^
(this part matched by paren. Note the \n is not matched!)

Perl would consider this a match because after the first parenthesized expression, you have "\n<jdg>". Well, that doesn't match the expression "<jdg>" (because of the initial newline), so yay! found a match.

In other words, initially, perl would have the \s* that you end your parenthesized expression with match the empty string, and therefore it would find a match and you'd end up stuffing things into the first clause that you didn't want. Another way to put it is that because of the freedom to choose what went into \s*, perl would choose the amount that allowed the expression as a whole to match. (and would fill \s* with the empty string for the first docket record, and newline for the second docket record)

To get perl to never find a match on the first docket record, I repeated the \s* in the negative lookahead as well. That way, no choice of what to put in \s* could make the expression as a whole match on the initial docket record, and perl had to give up and move to the second docket record.

But then there was a second problem! Remember how I said perl was really aggressive about finding matches anywhere it could? Well, next perl would expand your mdy\=.*?> bit to still find a result in the first docket record. After I added \s* to the negative lookahead, the first docket was still matching (but in a different spot) with:

<jdgdt mdy='02/25/2014'>\n<jdg>Opinion by Marvel, <e>J.</e></jdg>
^^^^^^^^^^^???????????????????^
(Underlined part matched by paren. ? denotes the bit matched by .*?)

See how perl expanded your .*? way beyond what you had intended? You'd intended that bit to match only stuff up to the first > character, but perl will stretch your non-greedy matches as far as necessary so that the whole pattern matches. This time, it stretched your .*? to cover the > that closed the <jdg> tag so that it could find a spot where the negative lookahead didn't block the match.

To keep perl from stretching your .*? pattern that far, I replaced .*? with [^>]*, which is really what you meant.

After these two changes, we then only found a match in the second docket record, as initially desired.

Daniel Martin
  • 23,083
  • 6
  • 50
  • 70
  • Thanx its working fine. Can you explain me why \s* need to be placed before as shown (?!\s*) as we have already put \s* here itself as shown (]*>\s*) . – Praveen Sep 11 '14 at 15:41
  • Thanx for the explanation. Got ur point and it made me clear now. – Praveen Sep 11 '14 at 15:47
  • Fixed up the explanation, including both changes I had to make, and fixed my example match stuff. – Daniel Martin Sep 11 '14 at 16:04
-2

Use positive lookahead. (?!<jdg>) or something similar, look it up.

GitRDun
  • 1
  • 1
  • @GiRDun : I have done that but i am unable to get the desired output which i mentioned clearly in the question. – Praveen Sep 11 '14 at 15:33
  • 1
    How was this at all useful? And that part of the regex you pasted as-is from his original post is a negative lookahead, not positive. – Brian Stephens Sep 11 '14 at 16:12