0

I am trying to use GREP to select multiple-line records from a file.

The records look something like that

########## Ligand Number :       1
blab bla bla
bla blab bla


########## Ligand Number :       2
blab bla bla
bla blab bla


########## Ligand Number :       3
bla bla bla


<EOF>

I am using Perl RegEx (-P).

To bypass the multiple line limitation in GREP, I use grep -zo. This way, the parser can consume multiple lines and output exactly what I want. generally, it works fine.

However, the problem is that the delimiter here is two empty lines after the end of last record line (three consecutive '\n' characters: one for end line and two for two empty lines).

When I try to use an expression like

    grep -Pzo '^########## Ligand Number :\s+\d+.+?\n\n\n' inputFile

it returns nothing. It seems that grep can't tolerate consecutive '\n' characters.

Can anybody give an explanation?

P.S. I bypassed it already by translating the '\n' characters to '\a' first, then translating them back. like this following example:

    cat inputFile | tr '\n' '\a' | grep -Po '########## Ligand Number :\s+\d+\a.+?\a\a\a' | tr '\a' '\n'

But I need to understand why couldn't GREP understand the '\n\n\n' pattern.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Amr ALHOSSARY
  • 196
  • 1
  • 14
  • Add `(?s)` at the start, or replace `.` with `[\s\S]`. In a PCRE regex, `.` does not match line break symbols by default, and `s` modifier enables the POSIX like dot behavior. – Wiktor Stribiżew Sep 20 '17 at 06:19
  • @WiktorStribiżew Please read my question carefully till the end. I am clearly asking "why couldn't GREP understand the '\n\n\n' pattern"? – Amr ALHOSSARY Sep 20 '17 at 06:26
  • Computers cannot "understand" anything. Either the engine matches a string or not. A `.` in a PCRE regex does not match `\n`. – Wiktor Stribiżew Sep 20 '17 at 06:27

1 Answers1

1

In a PCRE regex, . does not match line break symbols by default, and s modifier enables the POSIX like dot behavior.

Thus, add (?s) at the start, or replace . with [\s\S].

(?s)^########## Ligand Number :\s+\d+.+?\n\n\n
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • You are right. the problem was not parsing the '\n\n\n' pattern, but in parsing/understanding/matching the internal '.' as '\n'. – Amr ALHOSSARY Sep 20 '17 at 06:54