0

I have a file which has multiple quotes like the one below:

  <verse-no>quote</verse-no>
            <quote-verse>1:26,27 Man Created to Continually Develop</quote-verse>
            <quote>When Adam came from the Creator’s hand, he bore, in his physical, mental, and
                spiritual nature, a likeness to his Maker. “God created man in His own image”
                (Genesis 1:27), and it was His purpose that the longer man lived the more fully
                he should reveal this image—the more fully reflect the glory of the Creator. All
                his faculties were capable of development; their capacity and vigor were
                continually to increase. Ed 15
            </quote>

I want to remove all strings from <quote-verse>.....</quote-verse> line so that the end result will be <quote>1:26,27</quote>.

I have tried perl -pi.bak -e 's#\D*$<\/quote-verse>#<\/quote-verse>#g' file.txt

This does nothing. I am a beginner in perl (self taught) with less than 10 days experience. Please tell me what's wrong and how to proceed.

One Face
  • 417
  • 4
  • 10
  • 1
    `s/()([\d,:]+)[\D+](<\/quote-verse>)/$1$2$3/;` – DavidO Feb 07 '15 at 17:04
  • @DavidO I executed `perl -pi.bak -e 's/()([\d,:]+)[\D+](<\/quote-verse>)/$1$2$3/g' file.txt` and the resultant file showed no change – One Face Feb 07 '15 at 17:16
  • `[\D+]` should be `\D+` probably. – TLP Feb 07 '15 at 17:20
  • @TLP I tried without the square brackets first (I wrote the code on my own after seeing DavidO's comment so forgot to add the brackets) but it did not change anything – One Face Feb 07 '15 at 17:23
  • 2
    @CRags Concerning your problem: There is a long-standing meme on this site that you should not parse XML with regexes. You should use a parser instead. The answer most often referred to is this: http://stackoverflow.com/a/1732454/725418 With that said, it is possible to parse a limited set of XML with regexes, just so long as you know that you could be messing your format up. Is it your intent to change the tag (to ``) as well, or is that a typo? – TLP Feb 07 '15 at 18:24
  • There's a lot of hazards to it. I would suggest `XML::Twig` and will happily offer a workable example with a bit more XML to start off with. – Sobrique Feb 07 '15 at 18:42
  • @TLP You're correct, the `[]` braces weren't serving any useful function. Originally I had `[^\d,:]+, but then paired it back and should have removed the last remnants. :) – DavidO Feb 07 '15 at 20:35
  • @TLP no the tag is correct. I just want the strings to be removed leaving only digits behind. – One Face Feb 07 '15 at 20:57

1 Answers1

2

You have XML. Therefore you want an XML parser. XML::Twig is a good one. The reason there's a lot of people saying 'don't use regular expressions to parse XML' is because whilst it does work in a limited scope. But XML is a specification, and certain things are valid, some are not. If you make code that's built on assumptions that aren't always true, what you end up with is brittle code - code that will break one day without warning, if someone alters their perfectly valid XML into a slightly different but still perfectly valid XML.

So with that in mind:

This works:

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

sub quote_verse_handler {
    my ( $twig, $quote ) = @_;
    my $text = $quote->text;
    $text =~ s/(\d)\D+$/$1/;
    $quote->set_text($text);
}

my $parser = XML::Twig->new(
    twig_handlers => { 'quote-verse' => \&quote_verse_handler },
    pretty_print  => 'indented'
);


#$parser -> parsefile ( 'your_file.xml' );
local $/;
$parser->parse(<DATA>);
$parser->print;


__DATA__
<xml>
<verse-no>quote</verse-no>
        <quote-verse>1:26,27 Man Created to Continually Develop</quote-verse>
        <quote>When Adam came from the Creator's hand, he bore, in his physical, mental, and
            spiritual nature, a likeness to his Maker. "God created man in His own image"
            (Genesis 1:27), and it was His purpose that the longer man lived the more fully
            he should reveal this image-the more fully reflect the glory of the Creator. All
            his faculties were capable of development; their capacity and vigor were
            continually to increase. Ed 15
        </quote>
   </xml>

What this does is - run through your file. Each time it encounters a section quote-verse it calls the handler, and gives it 'that bit' of the XML to do stuff with. We apply a regular expression, to chop off the trailing bit of the line, and then update the XML accordingly.

Once parse is finished, we spit out the finished product.

You'll probably want to replace:

local $/;
$parser -> parse ( <DATA> );

with:

$parser -> parsefile ( 'your_file_name' );

You may also find:

$parser -> print_to_file( 'output_filename' ); 

to be useful.

Sobrique
  • 52,974
  • 7
  • 60
  • 101