-2

I finally know how to use regular expressions to replace one substring with another every place where it occurs within a string. But what I need to do now is a bit more complicated than that.

A string I must transform will have many instances of the newline character ('\n'). If those newline character are enclosed within fish-tags (between '<' and '>') I need to replace it with a simple whitespace character (' ').

However, if a newline character occurs anywhere else in the string, I need to leave that newline character alone.

There will be several places in the string that are enclosed in fish-tags, and several places that aren't.

Is there a way to do this in PERL?

Sophia_ES
  • 1,193
  • 3
  • 12
  • 22
  • 3
    Sounds like HTML? Do you have example input? – dawg Jul 13 '16 at 17:25
  • Yes ---- I am talking about HTML --- but preparing an example may take a while, as this question is to allow a script to accommodate a situation that may occur at any moment in the future, not one that already has. – Sophia_ES Jul 13 '16 at 17:40
  • Actually --- the kinds of files I'm talking about are text files that incorporate HTML tags ---- text files that I will later copy/paste into a WordPress blog. I'm writing a script that will take the file and generate a stylized HTML preview. – Sophia_ES Jul 13 '16 at 17:46
  • 1
    This may be an AB issue. If you are trying to extract and sanitize chunks of HTML you want to use an HTML parser that can handle html fragments. I really like HTML::TreeBuilder from the HTML-Tree distribution for this sort of task. https://metacpan.org/release/HTML-Tree – daotoad Jul 13 '16 at 20:32
  • "I've just learn't to use regular expressions, so now I want to parse HTML". Don't *do* that. Use an XML or HTML parser for it instead. – Kusalananda Jul 14 '16 at 08:38

4 Answers4

2

I honestly don't recommend doing this with regular expressions. Besides the fact that you should never parse html with a regular expression, it's also a pain to do negative matches with regular expressions and anyone reading the code will honestly have no idea what you just did. Doing it manually on the other hand is really easy to understand.

This code assumes well formed html that doesn't have tags starting inside the definition of other tags (otherwise you would have to track all the instances and increment/decrement a count appropriately) and it does not handle < or > inside quoted strings which isn't the most common thing. And if you're doing all that I really recommend you use a real html parser, there are many of them.

Obviously if you're not reading this from a filehandle, the loop would be going over an array of lines (or the output of splitting the whole text, though you would instead be appending ' ' or "\n" depending on the inside variable if you split since it would remove the newline)

use strict;
use warnings;

# Default to being outside a tag
my $inside = 0;

while(my $line = <DATA>) {
  # Find the last < and > in the string
  my ($open, $close) = map { rindex($line, $_) } qw(< >);
  # Update our state accordingly.
  if ($open > $close) {
    $inside = 1;
  } elsif ($open < $close) {
    $inside = 0;
  }
  # If we're inside a tag change the newline (last character in the line) with a space. If you instead want to remove it you can use the built-in chomp.
  if ($inside) {
    # chomp($line);
    substr($line, -1) = ' ';
  }
  print $line;
}

__DATA__
This is some text
and some more
<enclosed><a
 b
 c
> <d
 e
 f
>
<g h i



>
Tim Tom
  • 779
  • 3
  • 6
0

The (X)HTML/XML shouldn't be parsed with regex. But since no description of the problem is given here is a way to go at it. Hopefully it demonstrates how tricky and involved this can get.

You can match a newline itself. Together with details of how linefeeds may come in text

use warnings;
use strict;

my $text = do {  # read all text into one string
   local $/;
   <DATA>;
};

1 while $text =~ s/< ([^>]*) \n ([^>]*) >/<$1 $2>/gx;

print $text;

__DATA__
start < inside tags> no new line
again <inside, with one nl  
> out
more <inside, with two NLs

     and more text
>

This prints

start < inside tags> no new line
again <inside, with one nl   > out
more <inside, with two NLs      and more text  >

The negated character class [^>] matches anything other than >, optionally and any number of times with *, up to an \n. Then another such pattern follows \n, up to the closing >. The /x modifier allows spaces inside, for readability. We also need to consider two particular cases.

  • There may be multiple \n inside <...>, for which the while loop is a clean solution.

  • There may be multiple <...> with \n, which is what /g is for.

The 1 while ... idiom is another way to write while (...) { }, where the body of the loop is empty so everything happens in the condition, which is repeatedly evaluated until false. In our case the substitution keeps being done in the condition until there is no match, when the loop exits.

Thanks to ysth for bringing up these points and for the 1 while ... solution.

All of this necessary care for various details and edge cases (of which there may be more) hopefully convinces you that it is better to reach for an HTML parsing module suitable for the particular task. For this we'd need to know more about the problem.

Community
  • 1
  • 1
zdim
  • 64,580
  • 5
  • 52
  • 81
  • it didn't sound like the \n would be right before the `>`, so remove `>` from both parts of the substitution (and add the requested space character). also, if there are \n in multiple tags, you want /g, and if there are mulitple \n in a single tag, you want `1 while $string =~ ...` – ysth Jul 13 '16 at 17:46
  • @ysth Thank you -- yes, perils of mixing regex and HTML ... can't follow up now so will remove this until I can (unless someone else takes care of it :). Thank you. – zdim Jul 13 '16 at 17:51
0

Given:

$ echo "$txt"
Line 1
Line 2
    < fish tag line 1
 and line 2 >

< line 3 >

    < fish tag line 4
 and line 5 >

You can do:

$ echo "$txt" | perl -0777 -lpe "s/(<[^\n>]*)\n+([^>]*>)/\1\2/g"
Line 1
Line 2
    < fish tag line 1 and line 2 >

< line 3 >

    < fish tag line 4 and line 5 >

I will echo that this only works in limited cases. Please do not get in the general habit of using a regex for HTML.

dawg
  • 98,345
  • 23
  • 131
  • 206
  • 1
    It is bad practice to use `\1`, `\2` etc. in the replacement part of a substitution, as they normally insert characters with codes 1, 2 etc. A warning `\1 better written as $1` is issued if `use warnings 'syntax'` is enabled, which it should be. – Borodin Jul 14 '16 at 08:22
0

This solution uses zdim's data (thanks, zdim)

I prefer to use an executable replacement together with the non-destructive option of the tr/// operator

This solution finds all occurrences of strings enclosed in angle brackets <...> and alters all newlines within each one to single spaces

Note that it would be simple to allow for quoted substrings containing any characters by writing this instead

$data =~ s{ ( < (?: "[^"]+" | [^>] )+ > ) }{ $1 =~ tr/\n/ /r }gex;
use strict;
use warnings 'all';
use v5.14;  # For /r option

my $data = do {
    local $/;
    <DATA>;
};

$data =~ s{ ( < [^<>]+ > ) }{ $1 =~ tr/\n/ /r }gex;

print $data;


__DATA__
start < inside tags> no new line
again <inside, with one nl  
> out
more <inside, with two NLs

     and more text
>

output

start < inside tags> no new line
again <inside, with one nl       > out
more <inside, with two NLs           and more text     >
Community
  • 1
  • 1
Borodin
  • 126,100
  • 9
  • 70
  • 144