The (X)HTML/XML shouldn't be parsed with regex
. But since no description of the problem is given here is a way to go at it. Hopefully it demonstrates how tricky and involved this can get.
You can match a newline itself. Together with details of how linefeeds may come in text
use warnings;
use strict;
my $text = do { # read all text into one string
local $/;
<DATA>;
};
1 while $text =~ s/< ([^>]*) \n ([^>]*) >/<$1 $2>/gx;
print $text;
__DATA__
start < inside tags> no new line
again <inside, with one nl
> out
more <inside, with two NLs
and more text
>
This prints
start < inside tags> no new line
again <inside, with one nl > out
more <inside, with two NLs and more text >
The negated character class [^>]
matches anything other than >
, optionally and any number of times with *
, up to an \n
. Then another such pattern follows \n
, up to the closing >
. The /x
modifier allows spaces inside, for readability. We also need to consider two particular cases.
There may be multiple \n
inside <...>
, for which the while
loop is a clean solution.
There may be multiple <...>
with \n
, which is what /g
is for.
The 1 while ...
idiom is another way to write while (...) { }
, where the body of the loop is empty so everything happens in the condition, which is repeatedly evaluated until false. In our case the substitution keeps being done in the condition until there is no match, when the loop exits.
Thanks to ysth
for bringing up these points and for the 1 while ...
solution.
All of this necessary care for various details and edge cases (of which there may be more) hopefully convinces you that it is better to reach for an HTML parsing module suitable for the particular task. For this we'd need to know more about the problem.