1

I rarely write Perl and don't know how to phrase the question. I am using Perl as a "filter" to go through files.

echo "this is a test" | perl -pe 's/(this).*(test)?/\1 \2/'

returns only

this

I am looking for

this test
Grinnz
  • 9,093
  • 11
  • 18
user121392
  • 133
  • 1
  • 1
  • 8

2 Answers2

5

Regexes (the feature within the first section of the s/// operator) match against the provided text sequentially, greedily. This means that it will first find this (easy enough), then .* will match the entire rest of the string. (test)? is matched against the remaining string, which is nothing, and since it's optional, it succeeds.

One way to prevent .* from matching the rest of the string before the next part can try is to make it non-greedy, this is done by attaching the ? quantifier modifier (not to be confused with the ? quantifier which means zero-or-one). But this doesn't help here, because then it will just match the empty string (as the shortest string it can match), and (test)? will also still match the empty string afterward since it's not immediately followed by test.

Depending what you are trying to do, there are a couple possible solutions. First would be to make the (test) group non-optional by removing the ?, which will cause the match to try smaller and smaller matches for .* until the following text successfully matches (test) (a regex feature known as backtracking). Another option is anchoring the match to the end of the string with $ after a non-greedy .*? so that it will always look for (test) at the end of the string before falling back to matching the empty string (via sort of reverse backtracking).

/(this).*(test)/
/(this).*?(test)?$/

As a side note, your replacement variables should be $1 and $2, not \1 and \2; backslash variables are for use within the regex itself, and using them in the replacement is only supported as it's a feature of sed.

Grinnz
  • 9,093
  • 11
  • 18
  • Thank you. How would you suggest I rewrite the regex? I'm can't seem to remove the parts in between 'this' and 'test'. – user121392 Sep 17 '19 at 21:40
  • @user121392 It really depends on what your specific inputs are and what you're trying to do. The options I presented in the answer are as much as I could suggest generically. – Grinnz Sep 17 '19 at 21:42
  • the generic way to match any (non-newline) character up to but not including an optional "test" would be `(?:(?!test).)*` (so in full, `/(this)(?:(?!test).)*(test)?/`) – ysth Sep 17 '19 at 21:51
  • It is a string with an optional match in the middle ('test'), so anchoring to the end would not work. I am looking to extract specific parts of a string which have optional matches in the middle. – user121392 Sep 17 '19 at 21:51
  • @ysth I think your final regex is missing an opening parentheses. I can't get it to work. – user121392 Sep 17 '19 at 21:56
  • it was missing one briefly, I added it; try again? – ysth Sep 17 '19 at 21:56
  • @ysth Thank you. It worked. While I am reading up on non-capturing groups, why did you add (?!test) instead of (!test) and the . behind (?!test) ? – user121392 Sep 17 '19 at 22:08
  • @user121392 `(?!` is a negative lookahead, which means that group is equivlaent to `.*` but in addition it fails if the four characters after that point are `test`. Basically this just makes `.*` only match the characters up until that happens, after which it backtracks and the following `(test)?` can match. – Grinnz Sep 17 '19 at 22:20
  • see https://stackoverflow.com/questions/23403494/perl-matching-string-not-containing-pattern – ysth Sep 17 '19 at 22:23
-1

Since you're using Perl, this is a good way to do it.

Use this if you don't want to use any anchors.
Maybe you are not in a multi-line environment.
Btw, anchors are a crutch avoid using them if possible,
it will expand your mind.

(this)(?|.*(test)|.*())

https://regex101.com/r/1p4FVK/1

 ( this )                      # (1)
 (?|                           # Branch reset, reuse grp 2
      .* 
      ( test )                      # (2)
   |  
      .*  
      ( )                           # (2)
 )

Without the branch reset it's (this)(?:.*(test)|.*())
Replace with $1 $2$3