How to ignore parts of the text and do search-and-replace in the remaining part?

Question

While doing regex find-and-replace in text file, I wanna jump over & ignore certain segments of the text. That is, certain parts of the text should be excluded from the search, and only do search & replace in the remaining parts. The criteria is:

(1) anything between START and END should be excluded from the search & replace. START may or may not be at the start of a line; END may or may not be at the end of a line; one pair of START & END may span multiple lines;

(2) anything wihtin inline comment // should be ignored; // may or may not be at the start of line;

(3) the first word after . should be ignored; . may or may not be at the start of a line; the word may immediately follow . or with spaces, newlines, tabs splitting them.

Example code:

#!/usr/bin/env perl
use strict;
use warnings;

$/ = undef;

#iterate the DATA filehandle
while (<DATA>) {
    # This one replaces ALL occurrences of pattern.
    s/old/new/gs;

    # How do I skip the unwanted segments and do the replace?
    #print all
    print;
}

##inlined data filehandle for testing. 
__DATA__
xx START xx old xx END xx   --> ignore
xx old xx                   --> REPLACE !
START xx old                --> ignore
      xx old xx END         --> ignore
      xx old xx             --> REPLACE !
// xx old                   --> ignore
xx // xx old                --> ignore
xx . old old xx             --> ignore first one, replace second one
.
  old                       --> ignore
  (old) xx                  --> REPLACE !
xx old xx                   --> REPLACE !

Expected output is:

xx START xx old xx END xx   --> ignore
xx new xx                   --> REPLACE !
START xx old                --> ignore
      xx old xx END         --> ignore
      xx new xx             --> REPLACE !
// xx old                   --> ignore
xx // xx old                --> ignore
xx . old new xx             --> ignore first one, replace second one
.
  old                       --> ignore
  (new) xx                  --> REPLACE !
xx new xx                   --> REPLACE !

Can anyone help me with the regex here? I posted a similar question couple of hours ago, but that post was full of ambiguities and precludes a clear answer. Hopefully this post may be a "good" & "clear" question.

I'll remove my answer. [Did you try it?](https://regex101.com/r/yI0jH2/3) — bobble bubble, Feb 22 '16 at 07:56
[`How do (*SKIP) or (*F) work on regex?`](http://stackoverflow.com/a/24535912/5527985) — bobble bubble, Feb 22 '16 at 08:12
@bobblebubble They work perfect! Many thanks to your post (which you deleted); based on your code I solved my problem! — katyusza, Feb 22 '16 at 08:37
@bobblebubble Your original post is like: `s/(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;`; I changed it to `s/(?:(?:START.*?END)|\/\/.*?\n|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;` (so that comments will terminate at end of line, which is \n) and it totally solved the problem! Thank you~ — katyusza, Feb 22 '16 at 08:40
Also try `s/(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old/new/g;` It probably failed because you use `/gs` the `s` flag at the end which makes the dot in all pattern match newline. I only used the inline modifier `(?s:` for the part where it's needed. However, great you got it going :) restored answer as it seemed to be of help to solve problem. — bobble bubble, Feb 22 '16 at 08:46
@bobblebubble Yep, you're right. It's just either putting the `/s` flag inside or outside; glad I learned from you and @Jan the `(?s:...` syntax. Is it OK that I close this question with my own reply stating the final solution? — katyusza, Feb 22 '16 at 08:51
This is a really good example of why a 'single regex' solution to problems is a bad idea. — Sobrique, Feb 22 '16 at 10:50
@Sobrique Your replies have always been inspiring :-) Any further explanations on this? If "single regix" does work, why is it bad? — katyusza, Feb 22 '16 at 11:19
Imagine you come back to this code in 6 months time, and need to alter your regex. How much chance do you have of doing understanding it? — Sobrique, Feb 22 '16 at 11:42
@Sobrique Got that ;-) I figure you're talking about "What the hell is this!?" coding style. My solution is to use comments and documentations to document these "tricky" parts; hopefully they'll give me chance to understand it 6 months later :) — katyusza, Feb 26 '16 at 03:39
Yes. But the thing is - you have a code snippet that is going to be very hard to understand, and thus _requires_ documentation. However, if you have written out the algorithm longhand in `perl` rather than using `regex` as a programming language, then you'd end up with something that didn't actually require that. That's why I really object to the "magic regex" solutions, because they're MUCH harder to maintain, for no real benefit aside from 'looking clever'. — Sobrique, Feb 26 '16 at 10:32

score 2 · Accepted Answer · edited May 23 '17 at 11:45

2

You can use (*SKIP)(*F) verbs to skip something.

(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old

It works like this: (?:part 1 to skip|part 2 to skip|...)(*SKIP)(*F) | part to match

(?: opens a non capture group for alternation (?s: with s flag to make dot match newline
\w matches a word character [A-Za-z0-9_]
\b matches a word boundary

See demo at regex101

edited May 23 '17 at 11:45

Community

1
1

answered Feb 22 '16 at 07:15

bobble bubble

16,888
3
27
46

Thanks; I'm checking it out. – katyusza Feb 22 '16 at 07:22
@katyusza I thought it's obvious, that you need to use `old` or whatever in the part to match, [see update](https://regex101.com/r/yI0jH2/3). – bobble bubble Feb 22 '16 at 07:38
Yes, either one of the following can do what I want: (1) `s/(?:(?:START.*?END)|\/\/.*?\n|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;` (2) `s/(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old/new/g;` (3) `s/(?:START.*?END|\/\/.*?\n|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;` – katyusza Feb 22 '16 at 08:59
@katyusza [The dot default matches any character besides newline](http://www.regular-expressions.info/dot.html). So it would NOT skip over newlines by default. That the dot matches newlines was only needed for the `START.*?END` part which could span across multiple lines. That's why I used the inline [modifier](http://www.regular-expressions.info/modifiers.html) only for this part. – bobble bubble Feb 22 '16 at 09:10
In my intention, the newline may also appear between `.` and `pattern`. – katyusza Feb 22 '16 at 09:51

Jan · Answer 2 · 2016-02-22T11:34:59.533

2

You need to be more precise on your structue (i.e. when old should be ignored), but for your example the following regex will work (demo on regex101.com):

~                                       # delimiter
    (?s)(?:START).*?(?:END)(?-s)|       # look for START-END in single-line mode OR
    //.+|                               # everything after two forward slashes
    \.\sold|                             # the word old after a dot and space OR
    ^\s+old                             # old after spaces at the beginning of the line
    (*SKIP)(*FAIL)|                     # all these matches shall fail
    \b(old)\b                           # this one is to be kept
~xg                                     # verbose and multiline modifier

To read more about the concept, check this fantastic site - rexegg.com.

edited Feb 22 '16 at 11:34

answered Feb 22 '16 at 07:22

Jan

42,290
8
54
79

The demo exactly matches what I intended, but when I put it into `s///gs`, it issues an error: `syntax error at ... line .., near "+|"`, `Substitution pattern not terminated at ... line 10.` – katyusza Feb 22 '16 at 07:35
You need to use other delimiters: `~` instead of the forward slashes - otherwise, you need to escape the slashes like `\/\/`. – Jan Feb 22 '16 at 07:37
I escaped it using `\/\/`, now in the output file some stuffs are missing, which is not desired. I want full text to be printed. – katyusza Feb 22 '16 at 07:46
What do you mean by "I want full text to be printed" ? – Jan Feb 22 '16 at 07:48
I mean, I use `s/(?s)(?:START).*?(?:END)(?-s)|\/\/.+|\. old|^\s+old(*SKIP)(*FAIL)|\b(old)\b/new/gs;` in my code, and in the output result, first line becomes stuff like `xx new xx --> ignore`, the leading "xx START" and trailing "END xx" are gone. – katyusza Feb 22 '16 at 07:56
This is also the right answer :) Probably the key is to understand how `(*SKIP)(*F)` works. – bobble bubble Feb 22 '16 at 07:59
@bobblebubble Yep; I'm just googling (*SKIP)(*F) stuff >, – katyusza Feb 22 '16 at 08:07

score 0 · Answer 3 · answered Feb 22 '16 at 08:54

Thanks to the valuable contributions from @bobblebubble and @Jan, and based on the Perl code in their replies, I eventually learned to use (*SKIP)(*F) to skip, jumper over or ignore unwanted segments. The final code is:

#!/usr/bin/env perl
use strict;
use warnings;

$/ = undef;

#iterate the DATA filehandle
while (<DATA>) {
    # This one replaces ALL occurrences of pattern.
#    s/old/new/gs;

    # How to skip the unwanted segments and do the replace:
    # Both are good.
    #s/(?:(?:START.*?END)|\/\/.*?\n|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;
    s/(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old/new/g;
    #print all
    print;
}

##inlined data filehandle for testing. 
__DATA__
xx START xx old xx END xx   --> ignore
xx old xx                   --> REPLACE !
START xx old                --> ignore
      xx old xx END         --> ignore
      xx old xx             --> REPLACE !
// xx old                   --> ignore
xx // xx old                --> ignore
xx . old old xx             --> ignore first one, replace second one
.
  old                       --> ignore
  (old) xx                  --> REPLACE !
xx old xx                   --> REPLACE !

And, again, many thanks to bobble bubble and Jan.

How to ignore parts of the text and do search-and-replace in the remaining part?

3 Answers3

Linked