0

I am working on a regex which can help me in replacing a pattern in string.

The string that I have in stream is very long and after applying the regex(Find the pattern and then replace with constant value) I have to carry forward the string into my ETL stream.

To find:
<customer attribute="any number">
 like <customer attribute="1">
and replace with:
<customer>. (basically just keep "customer" and delete everything) 

I am new to Regex and learning it.

Any help!!

Vikas Kumar
  • 87
  • 2
  • 18

2 Answers2

3

Please, please, please. Don't use regular expressions to parse XML.

It's bad news. It's brittle and hacky, and most importantly of all - completely unnecessary.

Regular expressions do not handle context. And XML is all about context.

XML already has a query language called xpath which is far better suited.

Here's an example of finding a node using xpath.

#!/usr/bin/env perl

use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig -> new -> parsefile ('yourfile.xml'); 

print $twig -> get_xpath('//consumer', 0) -> att('attribute'),"\n";

But if you want to transform it and delete attribute:

$_ -> del_att('attribute') for $twig -> get_xpath('//consumer[@attribute]');
$twig -> set_pretty_print('indented_a');
$twig -> print;

I would ask though - why are you trying to do that? It sounds rather like another broken process somewhere - maybe another script trying to regex the XML?

But the other thing that XML::Twig does really well is it has twig_handlers that let you handle XML streams more neatly (e.g. without needing to parse it all into memory.

That goes a bit like this:

#!/usr/bin/env perl

use strict;
use warnings;
use XML::Twig;

sub delete_unwanted {
    my ( $twig, $element ) = @_; 
    $element -> del_att('attribute'); 
    #dump progress so far 'out'. 
    $twig -> flush; 
    #free memory already processed. 
    $twig -> purge; 
}

my $twig = XML::Twig -> new ( twig_handlers => { '//consumer[@attribute]' => \&delete_unwanted } );
   $twig -> parsefile ( 'your_xml.xml'); 

We set a handler, so that each time the parser encounters a consumer with an attribute attribute. (Bad name that) it deletes it, flushes (prints) the parsed XML, and purges it from memory. This makes it very memory efficient, as you're not reading the whole thing into memory, and can do pretty much inline regexing type operations.

Community
  • 1
  • 1
Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • 1
    The OP is talking about a stream. This would be a great place to advertise XML::Twig's ability to work with chunked data. – simbabque Nov 07 '16 at 10:56
  • @Sobrique. First of all thanks a lot for your time. You elaborated the point so well. But in my case I am using Penatho Kettle to make my first draft of XML. Then a lot of manipulation is done and for traversing purpose I added "attribute". Finally I need to remove that. – Vikas Kumar Nov 07 '16 at 13:41
  • Well, that sound like you may be tackling this problem in a less than ideal way. Most things you pass XML to, if the attribute is redundant... it'll be irrelevant. That's part of the point of XML. – Sobrique Nov 07 '16 at 13:52
0

Input:

<consumer attribute=\"1\"><birth-date>1990-07-23</birth-date> </consumer>;

my $element_name = "consumer";

my $str = "<consumer attribute=\"1\"><birth-date>1990-07-23</birth-date> </consumer>";

$str=~s/<($element_name)[^>]*attribute="[^\"]*"[^>]*>/<$1>/g;

print $str;

output:

<consumer><birth-date>1990-07-23</birth-date> </consumer>

ssr1012
  • 2,573
  • 1
  • 18
  • 30
  • @DaveCross: Do you point out in any posting if did this. I am sure I didn't encouraged regex using XML conversions. I just forwarded the answer for queries in regex. Enough. Thanks for your downvote. – ssr1012 Nov 07 '16 at 11:08