5

I need to write a Perl script to read in a file, and delete anything inside < >, even if they're on different lines. That is, if the input is:

Hello, world. I <enjoy eating
bagels. They are quite tasty.
I prefer when I ate a bagel to
when I >ate a sandwich. <I also
like >bananas.

I want the output to be:

Hello, world. I ate a sandwich. bananas.

I know how to do this if the text is on 1 line with a regex. But I don't know how to do it with multiple lines. Ultimately I need to be able to conditionally delete parts of a template so I can generate parametrized files for config files. I thought perl would be a good language but I am still getting the hang of it.

Edit: Also need more than 1 instance of <>

brian d foy
  • 129,424
  • 31
  • 207
  • 592
rlbond
  • 65,341
  • 56
  • 178
  • 228

4 Answers4

6

You may want to check out a Perl module Text::Balanced, part of the core distribution. I think it'll be of help for you. Generally, one wants to avoid regexes to do that sort of thing IF the subject text is likely to have an inner set of delimiters, it can get very messy.

Danny
  • 13,194
  • 4
  • 31
  • 36
6

In Perl:

#! /usr/bin/perl   
use strict;

my $text = <>;
$text =~ s/<[^>]*>//g;
print $text;

The regex substitutes anything starting with a < through the first > (inclusive) and replaces it with nothing. The g is global (more than once).

EDIT: incorporated comments from Hynek and chaos

brian d foy
  • 129,424
  • 31
  • 207
  • 592
OtherDevOpsGene
  • 7,302
  • 2
  • 31
  • 46
  • It's little bit ineffective. To split it and join again. perl -0777 -pe 's/<[^>]*>//gm' – Hynek -Pichi- Vychodil Apr 10 '09 at 14:38
  • the /m modifier isn't helping. It means 'treat as multiline', i.e. match ^ and $ at newlines, not 'this is multiline'. /s, treat as single line, is actually more what you'd want, but you don't need it because your pattern isn't concerned with whitespace. – chaos Apr 10 '09 at 14:46
  • 1
    I would put both angle brackets in the negated character class: s/<[^<>]*>//g. Otherwise, you could match from , which probably isn't what you want. – Alan Moore Apr 10 '09 at 18:16
  • Very useful. Chaos's answer, however, is more adaptable towards multi-character delimiters, I.E. using . and /s rather than [^(delimiter)] +1 for great advice though. – rlbond Apr 10 '09 at 20:56
4
local $/;
my $text = <>;
s/<.*?>//gs;
print $text;
chaos
  • 122,029
  • 33
  • 303
  • 309
  • If your string looks like this: ghi>, your regex leaves 'ghi>'. If nested or escaped brackets and other perverse cases "never happen" the regex is fine. To handle the perverse cases, use Text::Balanced, even though the interface is weird. – daotoad Apr 10 '09 at 16:26
1

Ineffective one-liner way

perl -0777 -pe 's/<.*?>//gs'

same as program

local $/;
my $text = <>;
s/<.*?>//gs;
print $text;

It depends how big text you want convert here is more effective one-liner consuming line by line

perl -pe 'if ($a) {(s/.*?>// and do {s/<.*?>//g; $a = s/<.*//s;1}) or $_=q{}} else {s/<.*?>//g; $a = s/<.*//s}'

same as program

my $a;
while (<>) {
    if ($a) {
        if (s/.*?>//) {
            s/<.*?>//g;
            $a = s/<.*//s;
        }
        else { $_ = q{} }
    }
    else {
        s/<.*?>//g;
        $a = s/<.*//s;
    }
    print;
}
Hynek -Pichi- Vychodil
  • 26,174
  • 5
  • 52
  • 73