0

I am doing a perl script which will do some formatting to an xml file. I need some help when it comes to ignoring white space before the opening of any xml tag. I have the following xml file

test.xml

   <xml>
      <TI>Definitions, Exemptions and Rebates "where"  


    <VARPARA><VAR>E</VAR></VARPARA></TI>
   </xml>  

I want a regex expression which will replace any whitespaces including extra spaces and new line characters before the opening of any xml tag with a single space, so in the above case <VARPARA> is the tag which has some white spaces and new line character after "where".

I was thinking something along the lines of

$s =~ s/\s*</ </ig; 

but here it will look at the opening tag < only, whereas I want to check both the opening < and closing tag > as well so

    <VARPARA>

.

The output string should look like below

    <xml>
      <TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
   </xml>  
Randall
  • 2,859
  • 1
  • 21
  • 24
atif
  • 1,137
  • 7
  • 22
  • 35
  • Why it isn't removes the spaces before ``? – Avinash Raj Aug 27 '14 at 16:53
  • How come the spaces and the new lines are still present before `` in the desired output? – ikegami Aug 27 '14 at 16:54
  • for that it will check if text is found before xml tag. spaces between the tags doesn't matter if no text found in my case. – atif Aug 27 '14 at 16:57
  • So remove trailing whitespace except when the entire text is whitespaces? – ikegami Aug 27 '14 at 16:58
  • Yes you can say this. – atif Aug 27 '14 at 16:59
  • 2
    This smacks of [parsing XML the Chtulhu way](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – tripleee Aug 27 '14 at 17:00
  • 1
    @Avinash Raj, breaks comments and CDATA, at least. – ikegami Aug 27 '14 at 17:03
  • @AvinashRaj regex101.com/r/vY0rD6/5 is ignoring the xml opening and closing tag, so it will still remove white space if tag is not present... – atif Aug 27 '14 at 17:07
  • it's better to go with parsers or something but not regex.... – Avinash Raj Aug 27 '14 at 17:09
  • @ikegami unfortunately I can't use a parser, can i try some thing like this \s*(?:\n\n+)\s*<[A-Z|a-z]+>? the problem here is that xml tag itself is replaced. Can we use grouping in a way that it will only replace the white space and keep the tag as it? – atif Aug 27 '14 at 17:20
  • @AvinashRaj unfortunately I can't use a parser, can i try some thing like this \s*(?:\n\n+)\s*<[A-Z|a-z]+>? the problem here is that xml tag itself is replaced. Can we use grouping in a way that it will only replace the white space and keep the tag as it? – atif Aug 27 '14 at 17:24
  • 1
    @atif, You have to use a parser. Parsing is to assign meaning to tokens. You can try all you want, but that doesn't check if `<` is the start of a tag. – ikegami Aug 27 '14 at 17:28
  • e.g., fails for ` <![CDATA[ ]]>` in two ways. – ikegami Aug 27 '14 at 17:34
  • 1
    @ikegami Guys, let's take a step back and if I say it's not an xml file and it's just a string "Definitions, Exemptions and Rebates where |E|. this is a string". I want to remove white space before any occurrence of |E|. How do we do that? Where "E" can be any number of alphabets surrounded by || – atif Aug 27 '14 at 17:56
  • 1
    That's a completely different question. The original question asked "before a start tag", not "before a string". It's far more complicated because you can't search for a string to find a start tag. If you have a new question to ask, post it as a proper question. – ikegami Aug 27 '14 at 18:10

3 Answers3

2

To determine if < is the start of a tag, you'd have to find out if it's in comment, in a CDATA section, etc. You need more than a regex. I recommend using an existing parser.

use XML::LibXML qw( );

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($qfn);

for my $text_node ($doc->findnodes('//text()')) {
   my $text = $text_node->data();
   next if $text =~ /^\s+\z/;

   my $next_node = $text_node->nextSibling();
   next if !$next_node;

   $text =~ s/\s+\z/ /;
   $text_node->setData($text);
}

$doc->toFile($qfn);
ikegami
  • 367,544
  • 15
  • 269
  • 518
0

I'm not an regex expert, so this probably will fail in some scenarios, but according to your last comment try the next:

echo '<xml>
      <TI>Definitions, Exemptions and Rebates "where"  


    <VARPARA><VAR>E</VAR></VARPARA></TI>

<TI>Definitions, Exemptions and Rebates "where"  


    <VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>' | perl -0777 -pE 's/(\S)(\s+)(<\w+?>)/$1 $3/g;s/> +</>\n</g'
<xml>
<TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
<TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>
clt60
  • 62,119
  • 17
  • 107
  • 194
0

This is how I handle it.

$s =~ s/\s+(?= \<\w+>)/ /xig;

atif
  • 1,137
  • 7
  • 22
  • 35