regex match for a string

Question

I am doing a perl script which will do some formatting to an xml file. I need some help when it comes to ignoring white space before the opening of any xml tag. I have the following xml file

test.xml

   <xml>
      <TI>Definitions, Exemptions and Rebates "where"  


    <VARPARA><VAR>E</VAR></VARPARA></TI>
   </xml>

I want a regex expression which will replace any whitespaces including extra spaces and new line characters before the opening of any xml tag with a single space, so in the above case <VARPARA> is the tag which has some white spaces and new line character after "where".

I was thinking something along the lines of

$s =~ s/\s*</ </ig;

but here it will look at the opening tag < only, whereas I want to check both the opening < and closing tag > as well so

    <VARPARA>

.

The output string should look like below

    <xml>
      <TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
   </xml>

How come the spaces and the new lines are still present before `` in the desired output? — ikegami, Aug 27 '14 at 16:54
for that it will check if text is found before xml tag. spaces between the tags doesn't matter if no text found in my case. — atif, Aug 27 '14 at 16:57
So remove trailing whitespace except when the entire text is whitespaces? — ikegami, Aug 27 '14 at 16:58
This smacks of [parsing XML the Chtulhu way](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — tripleee, Aug 27 '14 at 17:00
@AvinashRaj regex101.com/r/vY0rD6/5 is ignoring the xml opening and closing tag, so it will still remove white space if tag is not present... — atif, Aug 27 '14 at 17:07
it's better to go with parsers or something but not regex.... — Avinash Raj, Aug 27 '14 at 17:09
@ikegami unfortunately I can't use a parser, can i try some thing like this \s*(?:\n\n+)\s*<[A-Z|a-z]+>? the problem here is that xml tag itself is replaced. Can we use grouping in a way that it will only replace the white space and keep the tag as it? — atif, Aug 27 '14 at 17:20
@AvinashRaj unfortunately I can't use a parser, can i try some thing like this \s*(?:\n\n+)\s*<[A-Z|a-z]+>? the problem here is that xml tag itself is replaced. Can we use grouping in a way that it will only replace the white space and keep the tag as it? — atif, Aug 27 '14 at 17:24
@atif, You have to use a parser. Parsing is to assign meaning to tokens. You can try all you want, but that doesn't check if `<` is the start of a tag. — ikegami, Aug 27 '14 at 17:28
@ikegami Guys, let's take a step back and if I say it's not an xml file and it's just a string "Definitions, Exemptions and Rebates where |E|. this is a string". I want to remove white space before any occurrence of |E|. How do we do that? Where "E" can be any number of alphabets surrounded by || — atif, Aug 27 '14 at 17:56
That's a completely different question. The original question asked "before a start tag", not "before a string". It's far more complicated because you can't search for a string to find a start tag. If you have a new question to ask, post it as a proper question. — ikegami, Aug 27 '14 at 18:10

ikegami · Answer 1 · 2014-08-27T17:31:03.087

To determine if < is the start of a tag, you'd have to find out if it's in comment, in a CDATA section, etc. You need more than a regex. I recommend using an existing parser.

use XML::LibXML qw( );

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($qfn);

for my $text_node ($doc->findnodes('//text()')) {
   my $text = $text_node->data();
   next if $text =~ /^\s+\z/;

   my $next_node = $text_node->nextSibling();
   next if !$next_node;

   $text =~ s/\s+\z/ /;
   $text_node->setData($text);
}

$doc->toFile($qfn);

clt60 · Answer 2 · 2014-08-27T18:28:19.227

I'm not an regex expert, so this probably will fail in some scenarios, but according to your last comment try the next:

echo '<xml>
      <TI>Definitions, Exemptions and Rebates "where"  


    <VARPARA><VAR>E</VAR></VARPARA></TI>

<TI>Definitions, Exemptions and Rebates "where"  


    <VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>' | perl -0777 -pE 's/(\S)(\s+)(<\w+?>)/$1 $3/g;s/> +</>\n</g'
<xml>
<TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
<TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>

score 0 · Accepted Answer · answered Aug 27 '14 at 22:29

0

This is how I handle it.

$s =~ s/\s+(?= \<\w+>)/ /xig;

answered Aug 27 '14 at 22:29

atif

1,137
7
22
35

regex match for a string

3 Answers3