0

I'm trying to make a simple script in perl that will look for obvious errors in an xml file. I figured the best way to accomplish this is to create a regular expression and have perl use it to return bad lines of xml. Here is my code

#!/usr/bin/perl
$file = '/path/to/my/xml/file.txt';
open(txt, $file);
while($line = <txt>) {
  print "$line" if $line =~ m/<[a-zA-Z]*>[$a-zA-Z0-9]*>[a-zA-Z0-9]*</;
}
 close(txt);

The regex I'm using works perfectly in notepad++ but when I put it in perl it doesn't want to work. I'm trying to find a line of xml that looks like this

<tag>badline></tag>

if I break apart my regex lines get returned.

m/<[a-zA-Z]*> -works
[$a-zA-Z0-9]*> -works
[a-zA-Z0-9]*</; -works

but when I combine them like I showed in the code, it doesn't

Any help is greatly appreciated, thanks.

whit3y
  • 65
  • 7
  • 1
    This is a really bad way to process XML. Why not use one of CPAN's many XML parsers to check for errors? – friedo Oct 10 '13 at 16:30
  • 1
    Welcome to SO. This question seems to come up about once a day, so searching before you posted would have been appropriate. The simple answer is that regex is exactly the WRONG tool to use for processing XML. See [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454) Use a real XML parser, which you can find on CPAN. – Jim Garrison Oct 10 '13 at 16:31
  • you realize that `badline>` is actually well-formed XML do you? – mirod Oct 10 '13 at 17:06
  • @mirod: ...which, actually, is a good reason _not_ to use a real XML parser if you want to detect typos like that. Usually, you do want to parse the XML properly first, but in this case it's kind of hard to tell the difference between `badline>` (probably a typo) and `badline>` (probably intentional) after the markup has been parsed. – Ilmari Karonen Dec 16 '13 at 22:08
  • much later... @IlmariKaronen: I wouldn't call it a a typo, the 2 forms are pretty much identical as far as XML parsers are concerned, they both parse and they are both interpreted the same way, ie the parser will return the text `baseline>` in both case – mirod Sep 19 '14 at 08:01
  • @mirod: Sure, but depending on context, there's a pretty good chance that whoever wrote it *meant* to write ``. (OK, except that that's not valid XML; it should be ``, but still...) – Ilmari Karonen Sep 19 '14 at 10:13
  • @IlmariKaronen OK, it makes sense. A parser would catch it much later, when it would find the closing tag. But if the XML is keyed in you may have bigger problems than the occasional missing bracket. WHat about the rest of the data? – mirod Sep 19 '14 at 13:34

2 Answers2

1

You must always use strict and use warnings at the top of every Perl program, no matter how trivial, and declare all your variables using my at their first point of use. That would have warned you that Perl was trying to interpolate the variable $a within the regular expression, which is undefined and so evaluates to an empty string.

I don't know why you want to match dollar characters in your character class, but you need to escape it, like [\$a-zA-Z0-9], in a Perl regex.

Over all, though, unless you have a speficic formatting problem, I think it would be better to just put the XML through an XML parser or editor. That way any errors will be pointed out immediately, without you having to check for specific problems.

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Holy cow that did it. Thank you so much. I would have used a more traditional parser, but its for work and they wanted a script on a specific enviroment so anyone can run a the script to verify what developers are giving us. So I figured writting a simple Perl script was the way to go since regular expressions are built into it. If you have a recomendation on how I should go about this I'm all ears. Thanks again. – whit3y Oct 10 '13 at 17:24
  • If your XML has a DTD or a schema describing it then you need to use that. – Borodin Oct 10 '13 at 17:29
-1

I think it is better to use capital words for filehandle and remember to close the filehandle after using.

#!/usr/bin/perl -w //try to always use warning;
use strict; //try to open strict.
open(TXT,"/path/to/my/xml/file.txt") or die "Cannot open the file $!"; 
while(<TXT>)
{
    if(/<.*>(.*)?<.*>/) // I am not sure whether you have other formats but this one works well with the format your provided. 
    {
            print $_;
    }
 }
 close TXT;