3

While doing regex find-and-replace in text file, I wanna jump over & ignore certain segments of the text. That is, certain parts of the text should be excluded from the search, and only do search & replace in the remaining parts. The criteria is:

(1) anything between START and END should be excluded from the search & replace. START may or may not be at the start of a line; END may or may not be at the end of a line; one pair of START & END may span multiple lines;

(2) anything wihtin inline comment // should be ignored; // may or may not be at the start of line;

(3) the first word after . should be ignored; . may or may not be at the start of a line; the word may immediately follow . or with spaces, newlines, tabs splitting them.

Example code:

#!/usr/bin/env perl
use strict;
use warnings;

$/ = undef;

#iterate the DATA filehandle
while (<DATA>) {
    # This one replaces ALL occurrences of pattern.
    s/old/new/gs;

    # How do I skip the unwanted segments and do the replace?
    #print all
    print;
}

##inlined data filehandle for testing. 
__DATA__
xx START xx old xx END xx   --> ignore
xx old xx                   --> REPLACE !
START xx old                --> ignore
      xx old xx END         --> ignore
      xx old xx             --> REPLACE !
// xx old                   --> ignore
xx // xx old                --> ignore
xx . old old xx             --> ignore first one, replace second one
.
  old                       --> ignore
  (old) xx                  --> REPLACE !
xx old xx                   --> REPLACE !

Expected output is:

xx START xx old xx END xx   --> ignore
xx new xx                   --> REPLACE !
START xx old                --> ignore
      xx old xx END         --> ignore
      xx new xx             --> REPLACE !
// xx old                   --> ignore
xx // xx old                --> ignore
xx . old new xx             --> ignore first one, replace second one
.
  old                       --> ignore
  (new) xx                  --> REPLACE !
xx new xx                   --> REPLACE !

Can anyone help me with the regex here? I posted a similar question couple of hours ago, but that post was full of ambiguities and precludes a clear answer. Hopefully this post may be a "good" & "clear" question.

katyusza
  • 325
  • 2
  • 12
  • I'll remove my answer. [Did you try it?](https://regex101.com/r/yI0jH2/3) – bobble bubble Feb 22 '16 at 07:56
  • I'm trying; have not found a solution yet >_ – katyusza Feb 22 '16 at 07:59
  • [`How do (*SKIP) or (*F) work on regex?`](http://stackoverflow.com/a/24535912/5527985) – bobble bubble Feb 22 '16 at 08:12
  • @bobblebubble They work perfect! Many thanks to your post (which you deleted); based on your code I solved my problem! – katyusza Feb 22 '16 at 08:37
  • @bobblebubble Your original post is like: `s/(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;`; I changed it to `s/(?:(?:START.*?END)|\/\/.*?\n|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;` (so that comments will terminate at end of line, which is \n) and it totally solved the problem! Thank you~ – katyusza Feb 22 '16 at 08:40
  • 2
    Also try `s/(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old/new/g;` It probably failed because you use `/gs` the `s` flag at the end which makes the dot in all pattern match newline. I only used the inline modifier `(?s:` for the part where it's needed. However, great you got it going :) restored answer as it seemed to be of help to solve problem. – bobble bubble Feb 22 '16 at 08:46
  • @bobblebubble Yep, you're right. It's just either putting the `/s` flag inside or outside; glad I learned from you and @Jan the `(?s:...` syntax. Is it OK that I close this question with my own reply stating the final solution? – katyusza Feb 22 '16 at 08:51
  • 1
    This is a really good example of why a 'single regex' solution to problems is a bad idea. – Sobrique Feb 22 '16 at 10:50
  • @Sobrique Your replies have always been inspiring :-) Any further explanations on this? If "single regix" does work, why is it bad? – katyusza Feb 22 '16 at 11:19
  • 1
    Imagine you come back to this code in 6 months time, and need to alter your regex. How much chance do you have of doing understanding it? – Sobrique Feb 22 '16 at 11:42
  • @Sobrique Got that ;-) I figure you're talking about "What the hell is this!?" coding style. My solution is to use comments and documentations to document these "tricky" parts; hopefully they'll give me chance to understand it 6 months later :) – katyusza Feb 26 '16 at 03:39
  • Yes. But the thing is - you have a code snippet that is going to be very hard to understand, and thus _requires_ documentation. However, if you have written out the algorithm longhand in `perl` rather than using `regex` as a programming language, then you'd end up with something that didn't actually require that. That's why I really object to the "magic regex" solutions, because they're MUCH harder to maintain, for no real benefit aside from 'looking clever'. – Sobrique Feb 26 '16 at 10:32
  • @Sobrique Got that; learning it. Thank you~! – katyusza Mar 02 '16 at 03:03

3 Answers3

2

You can use (*SKIP)(*F) verbs to skip something.

(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old

It works like this: (?:part 1 to skip|part 2 to skip|...)(*SKIP)(*F) | part to match

See demo at regex101

Community
  • 1
  • 1
bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • Thanks; I'm checking it out. – katyusza Feb 22 '16 at 07:22
  • @katyusza I thought it's obvious, that you need to use `old` or whatever in the part to match, [see update](https://regex101.com/r/yI0jH2/3). – bobble bubble Feb 22 '16 at 07:38
  • Yes, either one of the following can do what I want: (1) `s/(?:(?:START.*?END)|\/\/.*?\n|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;` (2) `s/(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old/new/g;` (3) `s/(?:START.*?END|\/\/.*?\n|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;` – katyusza Feb 22 '16 at 08:59
  • @katyusza [The dot default matches any character besides newline](http://www.regular-expressions.info/dot.html). So it would NOT skip over newlines by default. That the dot matches newlines was only needed for the `START.*?END` part which could span across multiple lines. That's why I used the inline [modifier](http://www.regular-expressions.info/modifiers.html) only for this part. – bobble bubble Feb 22 '16 at 09:10
  • In my intention, the newline may also appear between `.` and `pattern`. – katyusza Feb 22 '16 at 09:51
2

You need to be more precise on your structue (i.e. when old should be ignored), but for your example the following regex will work (demo on regex101.com):

~                                       # delimiter
    (?s)(?:START).*?(?:END)(?-s)|       # look for START-END in single-line mode OR
    //.+|                               # everything after two forward slashes
    \.\sold|                             # the word old after a dot and space OR
    ^\s+old                             # old after spaces at the beginning of the line
    (*SKIP)(*FAIL)|                     # all these matches shall fail
    \b(old)\b                           # this one is to be kept
~xg                                     # verbose and multiline modifier

To read more about the concept, check this fantastic site - rexegg.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • The demo exactly matches what I intended, but when I put it into `s///gs`, it issues an error: `syntax error at ... line .., near "+|"`, `Substitution pattern not terminated at ... line 10.` – katyusza Feb 22 '16 at 07:35
  • You need to use other delimiters: `~` instead of the forward slashes - otherwise, you need to escape the slashes like `\/\/`. – Jan Feb 22 '16 at 07:37
  • I escaped it using `\/\/`, now in the output file some stuffs are missing, which is not desired. I want full text to be printed. – katyusza Feb 22 '16 at 07:46
  • What do you mean by "I want full text to be printed" ? – Jan Feb 22 '16 at 07:48
  • I mean, I use `s/(?s)(?:START).*?(?:END)(?-s)|\/\/.+|\. old|^\s+old(*SKIP)(*FAIL)|\b(old)\b/new/gs;` in my code, and in the output result, first line becomes stuff like `xx new xx --> ignore`, the leading "xx START" and trailing "END xx" are gone. – katyusza Feb 22 '16 at 07:56
  • This is also the right answer :) Probably the key is to understand how `(*SKIP)(*F)` works. – bobble bubble Feb 22 '16 at 07:59
  • @bobblebubble Yep; I'm just googling (*SKIP)(*F) stuff >, – katyusza Feb 22 '16 at 08:07
0

Thanks to the valuable contributions from @bobblebubble and @Jan, and based on the Perl code in their replies, I eventually learned to use (*SKIP)(*F) to skip, jumper over or ignore unwanted segments. The final code is:

#!/usr/bin/env perl
use strict;
use warnings;

$/ = undef;

#iterate the DATA filehandle
while (<DATA>) {
    # This one replaces ALL occurrences of pattern.
#    s/old/new/gs;

    # How to skip the unwanted segments and do the replace:
    # Both are good.
    #s/(?:(?:START.*?END)|\/\/.*?\n|\.\s*\w+\b)(*SKIP)(*F)|old/new/gs;
    s/(?:(?s:START.*?END)|\/\/.*|\.\s*\w+\b)(*SKIP)(*F)|old/new/g;
    #print all
    print;
}

##inlined data filehandle for testing. 
__DATA__
xx START xx old xx END xx   --> ignore
xx old xx                   --> REPLACE !
START xx old                --> ignore
      xx old xx END         --> ignore
      xx old xx             --> REPLACE !
// xx old                   --> ignore
xx // xx old                --> ignore
xx . old old xx             --> ignore first one, replace second one
.
  old                       --> ignore
  (old) xx                  --> REPLACE !
xx old xx                   --> REPLACE !

And, again, many thanks to bobble bubble and Jan.

katyusza
  • 325
  • 2
  • 12