How to efficiently search/replace certain strings in a file in perl?

Question

My file looks like this:

<MAIN>  
  <SUB_MAIN>one</SUB_MAIN>  
  <VER>version#</VER>  
  (OTHER STUFF...)  
  <LOCATION>PATH</LOCATION>  
</MAIN>

<MAIN>  
  <SUB_MAIN>two</SUB_MAIN>  
  <VER>version#</VER>  
  (OTHER STUFF...)  
  <LOC>PATH</LOC>  
</MAIN>

What I want to do is to search for the value of SUB_MAIN lets say one, and if I find it then look for the value of LOCATION. Go to that location do some syncing get a new version from there and update the VER information.

My current code has like three loops and is ugly. The skeleton is like this:

$value = "one|two|three";

# for each line in file
while ($line < @FileDat) {

    # see if it is a sub module?   
    if ( $line =~ /\<SUB_MAIN\>$value\<\/SUB_MAIN\>/ ) 
    {   
       $found_it = 0;

        while (!$found_it) 
        {       
            $lineNum++;     
            if ( $FileDat[$lineNum] =~ /\<VER\>\d+\<\/VER\>/ ) 
            {
                $currIndex = $lineNum;

                while(1)
                {
                   $lineNum++;
                   if ( $FileDat[$lineNum] =~ /\<LOC\>(.+)\<\/LOC\>/ ) 
                    {   #DO SOME STUFF...
                        $found_it = 1;
                        last;
                    }
                }               
                        #replace version #
                $FileDat[$currIndex] = "    <VER>$latestChangeList</VER>\n";
            }
        }
    }
    $lineNum++;
}

# write the modified array to new file
print NEWCFGFILEPTR @FileDat;

close(OPEN_FILES);

How can I make it better?
Thank you.

and the answers are more on how to do this properly so it doesn't break randomly when the format of the pseudo-XML changes. Even the problem as stated is easier to solve with XML tools than with nested loops and regexps. — mirod, Nov 12 '11 at 10:18
No no no !!!! Do **not** use regular expressions to parse HTML/XML unless you have a *very* good reason to! Please see [this SO question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) for a highly entertaining answer. — Matt Fenwick, Nov 11 '11 at 17:24

score 1 · Accepted Answer · edited Nov 14 '11 at 11:43

1

Use XML::Simple. There is no need to reinvent the wheel, unless you are planning on making it better, which I highly doubt that this is your task.

edited Nov 14 '11 at 11:43

daxim

39,270
4
65
132

answered Nov 11 '11 at 17:32

FailedDev

26,680
9
53
73

correct but i do want to find out if i can reduce the number of loops i have :) – infinitloop Nov 11 '11 at 18:08
No need for loops or whatever. You can customize the xml simple in such a way that it can group elements together or put them into hashes etc. So you can adjust it to your needs. – FailedDev Nov 11 '11 at 18:10
@the input file is not XML though, so it requires a little marshalling before XML::Simple can be used on it. – mirod Nov 12 '11 at 10:14

score 1 · Answer 2 · answered Nov 12 '11 at 10:13

Actually, using an XML parser is a bit more complex than just using an XML module, since what you have is NOT well-formed XML. A well-formed XML file would have a single root, so all the MAIN elements would be wrapped in a single element.

There is a relatively simple way to fake it though, which is to wrap your file, referenced in an XML entity, in a proper high-level element.

Also, in your example data, you have a LOCATION element in the first MAIN, then a LOC element in the second MAIN, I assume it's a cut'n paste error.

Here is a way to do this with XML::Twig, that would work with an input file of any size (including to big to fit in memory), and that would output to the standard output.

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

binmode( STDOUT, ':utf8'); # if your input file is in UTF-8

my $file= shift @ARGV;
# wrap the content of the file in <data>...</data> so it becomes well-formed XML
my $xml= qq{<?xml version="1.0"?>
            <!DOCTYPE data [ <!ENTITY file SYSTEM "$file">]>
            <data>&file;</data>
           };

XML::Twig->new( twig_handlers => { MAIN => \&main },
                keep_spaces => 1,
              )
         ->parse( $xml);

exit;

sub main
  { my( $t, $main)= @_;
    my $location= $main->field( 'LOCATION');
    $main->set_field( VER => get_version( $location));
    $main->print;
    $main->purge; # if the file is big and you want to free the memory
  }

sub get_version
  { my( $location)= @_;
    return "new.version.$location"; # the real code might be different!
  }

If your input file is NOT in UTF-8 you may need to change the wrapper to add the proper encoding to the XML declaration. If it is in pure ASCII is used, then you're good (and should UTF-8 characters be added, it will still work).

If you don't want to use XML::Twig, the same technique applies to create proper XML that can be read by XML::Simple or whatever other module you want to use.

score 0 · Answer 3 · edited Nov 14 '11 at 11:42

0

You have an XML file. Rather than parsing that with regular expressions (which is generally considered to be a Bad Idea), try using one of the existing XML parsing modules, like XML::Parser. There are many other modules like it, which you can find by searching for xml on search.cpan.org, but that's a good one.

edited Nov 14 '11 at 11:42

daxim

39,270
4
65
132

answered Nov 11 '11 at 17:23

Dan

10,531
2
36
55

How to efficiently search/replace certain strings in a file in perl?

3 Answers3