Perl regex over multiple lines

Question

I have 2 input files.

$> cat file1.txt
! This is a comment in file1.txt
// Another comment and below line is an empty line

SIR 8 
    TDI(03)
    TDO(01)
    MASK(03);

and

$> cat file2.txt
! This is a comment in file2.txt
// Another comment and below line is an empty line

sir 8 tdi(03) tdo(01) mask(03);

Now, I'm trying to write a script that would harvest all those 'sir' lines. This is what I have:

while(<>) {
    # Skip over all lines that start with ! or are empty or start with //
    next unless !/^!/ and !/^\s*$/ and !/^\s*\/\//;

    # I'm using the modifier /i to be case insensitive
    if(/sir\s+\d+\s+tdi\(\d+\)\s+tdo\(\d+\)\s+mask\(\d+\)\s*;/i) {
        print $_;
    }
}

This matches now file2.txt which is on a single line but not file1.txt which is on multiple lines. I googled a lot and tried the modifiers /m /s and /g which were suggested but with no luck. Please can you help me to find the right syntax?

ikegami · Answer 1 · 2019-11-15T16:33:02.160

3

You are reading in a line at at time and matching against that, so you can't possibly match something that spans more than one line.

It's easiest to read the whole file at once by undefining $/.

local $/;

while (<>) {
    while (/^sir\s+\d+\s+tdi\(\d+\)\s+tdo\(\d+\)\s+mask\(\d+\)\s*;/mig) {
        print "$&\n";
    }
}

The /m makes the ^ match the start of a line.

Replacing if (//) with while (//g) allows us to get all the matches.

As a one-liner,

perl -0777ne'CORE::say $& while /^SIR[^;]*;/mig'

Specifying file to process to Perl one-liner

edited Nov 15 '19 at 16:33

answered Nov 15 '19 at 16:17

ikegami

367,544
15
269
518

This regex is shorter than the one you used: `SIR[^;]+;` – Federico Piazza Nov 15 '19 at 16:41
@Federico Piazza, Yes, but adding the `^` tells a lot to the reader. It also prevents commented out lines from matching, something the OP was interested in. It also requires something between `SIR` and `;`, which is weird. Finally, it also prevents `XSIR` from matching (but not `SIRX`). Use `^SIR\b` to prevent `SIRX` from matching. In short, this isn't a golfing contest. – ikegami Nov 15 '19 at 16:48
@ikegami Thanks a lot! The `next unless` line no longer works (it now removes some SIR lines) but that's ok and I haven't tried to fix it. But there is another case which is still not working. `SIR 8\n TDI(0000\n0000\n0000)\n...`, basically when TDI() goes over multiple lines. I've tried so many things the last days but didn't find an elegant way. I would appreciate if you could help me on this one as well. Thanks, Amir – Amir Nov 21 '19 at 15:27
If the paren is still open, read more lines until it's closed. – ikegami Nov 21 '19 at 16:14
You mean with additional `if()` statements? I was hoping to have something within the `while(/^sir.../)` or a magic modifier to achieve this. – Amir Nov 21 '19 at 16:46
Honestly, I'd just load the entire file into memory, and fox up the lines with two s///g, one for sir, one for tdi. – ikegami Nov 21 '19 at 17:15
Well, there are lots of what we call in Germany "trick 17"s to do this and I have already one way implemented as a workaround. I was rather interested to see if there is a better way to do this in a single regex that would match all possible ways. I have faced this issue multiple times already and looking for a better way. Still, thanks a lot for your support! – Amir Nov 21 '19 at 22:31
Re single regex, You are trying to do two very different things. – ikegami Nov 22 '19 at 17:18
I don't think so. I want to read in a special pattern `sir 8 tdi(03) tdo(01) mask(03);` in all possible ways it could occur, that's just a single thing I'm trying to do – Amir Nov 23 '19 at 13:16
But in one case, you wanted to replace `\nSIR\s+` with a single space. In the other, a `\n` with a single space. The conditions under which you wanted to do those things are also different – ikegami Nov 23 '19 at 14:28
That's a misunderstanding then, sorry. I don't want to replace anything. All I want is to read the sir pattern and extract the information from tdi/tdo/mask. And this instruction comes in multiple flavors (single line, multiple lines, small/capital letters, ...). – Amir Nov 24 '19 at 15:48
That doesn't make a difference. In one case, you want to treat `\nSIR\s+` as horizontal whitespace. In another, `\n` but not `\nSIR\s+`. In others yet, neither `\nSIR\s+` nor `\n` should be considered horizontal whitespace. These situations must therefore be treated differently. – ikegami Nov 24 '19 at 16:13

Perl regex over multiple lines

1 Answers1