1

I am trying to add a stylesheet declaration to the second line of any XML file my script processes. My script reads the file line by line into the $inputline string within a loop.

I have the following poorly-written Perl code:

while(<INPUT>) {

$inputline = $_;

if ($inputline =~ m/\<\?xml\ version\=\"1\.0\"\ encoding\=\"UTF-8\"\?\>/){

print OUTPUT "\<\?xml version\=\"1.0\" encoding\=\"UTF-8\"\?\>\n";
print OUTPUT "\<\?xml\-stylesheet type\=\"text\/xsl\" href\=\"askaway_transcript_stylesheet\.xsl\"\?\>\n";
}

#lots of other processing stuff
}

And I think this worked once, but it no longer does. Testing different output and tweaking things tells me that the IF statement is failing, and I've probably done something wrong there.

Any tips?

Brandon
  • 13
  • 3
  • 1
    *Any* xml file? If so, this will only match one *specific* xml header. As for matching on XML, I would refer you to: http://stackoverflow.com/a/1732454/179216 – Jeff B Aug 08 '13 at 18:42
  • If there is any deviation from this pattern in your input, e.g., different punctuation or stray blanks in the middle, your regex won't match. Are you sure no variations are occurring in your input? – Doctor Dan Aug 08 '13 at 18:47
  • 1
    You've got a `\ `in front of the letter `U`. Perl treats any escaped punctuation as the literal punctuation character and any escaped letter as a special regex command. `\U` uppercases the following characters in the string. (Actually, `\U` isn't a regex escape sequence, it's a double-quoted-string escape-sequence) – Adrian Pronk Aug 08 '13 at 19:08
  • For now, the headers should all be the same. I know it's pretty rigid, but I can fix that later. – Brandon Aug 08 '13 at 20:35
  • Adrian - thanks for pointing out the escaped U... Stupid error. – Brandon Aug 08 '13 at 20:35

1 Answers1

1

You have a very rigid regex to find the XML header. What if there are extra spaces? What if the encoding is different, or the xml version? Regex is not the right tool for parsing XML/HTML (see this answer), however it is understandable why you would want to use regex, especially given the limited scope of what you are trying to do.

That being said, if you are going for simplicity, and you are willing to be open to some possible failures, I would opt for a simpler regex and only do the replacement once:

my $replaced = 0;
if ($inputline =~ m/\<\?xml\b.*\>/ && !$replaced) {

    print OUTPUT $inputline;
    print OUTPUT '<?xml-stylesheet type="text/xsl" href="askaway_transcript_stylesheet.xsl"?>'."\n";

    $replaced = 1;
}

Alternately, you could exit your parse loop, assuming that is all you are doing in the loop.

Caveat:

  • If your XML is all written on one line, or even if there is another tag on the same line (which is legal), this will most likely break your XML.

Edit:

Your entire while loop would probably look like this:

while($inputline = <MYXML>) {
    my $replaced = 0;
    if ($inputline =~ m/\<\?xml\b.*\>/ && !$replaced) {

        print OUTPUT $inputline;
        print OUTPUT '<?xml-stylesheet type="text/xsl" href="askaway_transcript_stylesheet.xsl"?>'."\n";

        $replaced = 1;
    } else {
        print OUTPUT $inputline;
    }
}

Or:

while($inputline = <MYXML>) {
    my $replaced = 0;

    print OUTPUT $inputline;

    if ($inputline =~ m/\<\?xml\b.*\>/ && !$replaced) {
        print OUTPUT '<?xml-stylesheet type="text/xsl" href="askaway_transcript_stylesheet.xsl"?>'."\n";

        $replaced = 1;
    }
}
Community
  • 1
  • 1
Jeff B
  • 29,943
  • 7
  • 61
  • 90
  • Thanks! This approach seems to work, but for some reason it's still printing the original declaration below the two strings printed in the IF statement. Any ideas out of that? – Brandon Aug 08 '13 at 20:39
  • If you notice, I `print OUTPUT $inputline;`, instead of printing it explicitly. Did you remove the `print OUTPUT "\<\?xml v....` line? – Jeff B Aug 08 '13 at 20:45
  • Or do you `print OUTPUT $inputline;` outside of the `if` statement? If so, you need to put it in an `else` block, or rearrange your code. See edit above. – Jeff B Aug 08 '13 at 20:52
  • I've edited my question to add some more detail. This all takes place inside a loop in which I do lots of different processing to the text. (The other processing works fine.) There is no "print OUTPUT $inputline;" outside of the IF statement. I've tried those modified bits of code as suggested, but I still end up printing the original declaration after the replacement. – Brandon Aug 08 '13 at 21:02
  • So you aren't outputting the rest of the file anywhere? Are you doing a `print OUTPUT;`? – Jeff B Aug 08 '13 at 21:06
  • Ah, another dumb mistake. Yes, I'm doing a print OUTPUT; far later. I removed the else {}, and it works fine. Tremendous thanks, Jeff B. – Brandon Aug 08 '13 at 22:55