Perl multi line regex match and remove

Question

I have a text file that I am processing inside perl script. How do I remove 2 or more lines that contains only *. Input:

some text
*
some text
*
*
some text
*
*
*

I want to the text to look like this:

some text
*
some text
*
some text
*

ikegami · Accepted Answer · 2020-03-07T05:52:04.877

1

You could read the whole file in.

perl -0777pe's/^\*\n\K(\*\n)+//mg'

(The above won't work as written if the line terminator is missing from the last line.)

Working line by line is also rather simple since there's no need to look ahead.

perl -ne'print if !$flag || !/^\*$/; $flag = /^\*$/;'

Specifying file to process to Perl one-liner

edited Mar 07 '20 at 05:52

answered Mar 07 '20 at 05:46

ikegami

367,544
15
269
518

Perfect. Thank you! – Oleg Bukhantsov Mar 07 '20 at 05:48

score 0 · Answer 2 · answered Mar 07 '20 at 07:34

0

You can slurp, or anyway read your entire file into a string (today computer should have enough memory for that) and substitute any sequence of special lines with one only.

use English;
my $contents = do { local $/; <> };
chomp $contents;
$contents .= $RS;
$contents =~ s/^(\*$RS)+/\1/mg;

The 'm' modifier set the '^' anchor to recognise any line's begin, instead of the begin of the entire string.

answered Mar 07 '20 at 07:34

Luca_65

123
9

1

`\1` is a regex pattern (that matches what the first capture matched). It makes no sense to use in string literals. You should be using `$1`, and Perl issues a warning saying as much. (You could also use the slightly faster `*\n`.) – ikegami Mar 08 '20 at 00:09
1

`use English; ...$RS...` is a obfuscated way of writing `...\n...` – ikegami Mar 08 '20 at 00:09
The use of \n works; I personally prefer $RS which is the input record separator in the current operating system eventually adapted to specific needs. Definition of $ from perlvar: ... Mnemonic: like \digits – Luca_65 Mar 08 '20 at 06:23
1

`$RS` is `"\n"` on all OSs. – ikegami Mar 08 '20 at 06:24
Forgot to mention that it's rather weird that you use `$RS` in some places and `$/` in others. Same var! – ikegami Mar 08 '20 at 06:27
To be specific: all magic of "record separator of the current operating system" is performed by the literal "\n" itself, or the `:crlf` layer if it should be translated from CRLF on that OS. `$RS` does not get you anything extra, except that it can be changed. – Grinnz Mar 09 '20 at 15:32

Polar Bear · Answer 3 · 2020-03-08T07:05:31.110

-1

Other possible solution with 'one liner' is

perl -0777 -pe "s/(\n\*)+/$1/g" regex_stars.txt

Input

some text
*
some text
*
*
some text
*
*
*

Output

some text
*
some text
*
some text
*

edited Mar 08 '20 at 07:05

answered Mar 07 '20 at 06:10

Polar Bear

6,762
1
5
12

1

Fails for `"abc\n*\n*def\n"` and for `"*\n*\nxyz\n"` – ikegami Mar 08 '20 at 06:29
@ikegami - original post has different format than what your refer to. – Polar Bear Mar 08 '20 at 06:46
No edits have been made to the OP (except possibly in the first 5 minutes after posting, but you answered 30 minutes later). Either way, this answer doesn't answer the question as it stands. – ikegami Mar 08 '20 at 06:47
@ikegami - I once more in my answer added _Input_ block and _Output_ block and rerun the command. I do not see any difference from from OP's input/output. – Polar Bear Mar 08 '20 at 07:07
Why are you spamming me with irrelevant comments? – ikegami Mar 08 '20 at 08:49
The point is that it produces expected result with provided data by OP and you if first line is text followed by double `*` as in OP's example. OP's question _How do I remove 2 or more lines that contains only `*`_. – Polar Bear Mar 08 '20 at 08:52
@ikegami -- in OP's post no word that `*` followed by anything else your point 1 is not applicable, point 2 also is not applicable as OP's sample does not contain `*` on first line -- although it might be the case. I gave an idea but not solution for all possible cases [two `*` on same line, two `*` on same line separated by other symbols and so on]. – Polar Bear Mar 08 '20 at 08:56
You are mistaken. The OP clearly said the lines must contain *only* `*`. Read the passage you just quoted. So you are removing lines and you shouldn't, and you aren't removing lines that you should – ikegami Mar 08 '20 at 08:56
@ikegami -- where OP said that one of the line with `*` can be followed by any other symbol? Why did you bring this case into equation? Exactly lines must contain only `*` and nothing else (no spaces tabs or `def` as in your samples). – Polar Bear Mar 08 '20 at 08:59
Yes. Unfortunately, your solution doesn't handle that format. You should document your code's limitations. Because you didn't, I had to do it for you – ikegami Mar 08 '20 at 15:54

Perl multi line regex match and remove

3 Answers3