0

In Powershell (5.1 or 7), I run:

PS R:\> "abcdef" -replace '.*','x'
xx
PS R:\> "abcdef" -replace '.+','x'
x
PS R:\> "abcdef" -replace '^.*','x'
x
PS R:\> "abcdef" -replace '^.+','x'
x
PS R:\>
PS R:\> "abcdef" -replace '^','x'
xabcdef
PS R:\>

As you can see, in the first run I got xx but was expecting a single x. Tried with sed in bash (executables from gitdir/usr/bin; msys I think), and got what I expected.

2021-05-01 01:34:27 /r :
$ echo "abcdef" | sed -E s/.*/x/g
x

2021-05-01 01:35:03 /r :
$ echo "abcdef" | sed -E s/.+/x/g
x

2021-05-01 01:35:08 /r :
$ echo "abcdef" | sed -E s/^.*/x/g
x

2021-05-01 01:35:17 /r :
$ echo "abcdef" | sed -E s/^.+/x/g
x

2021-05-01 01:35:20 /r :
$ echo "abcdef" | sed -E s/^/x/g
xabcdef

2021-05-01 01:35:25 /r :
$

I have tried the documentation and cant figure out how to understand what is happening.

miwelus
  • 19
  • 4
  • Seems this is a regex behavior, not powershell specifically. 2 matches are returned. I cannot explain it though. https://regex101.com/r/TE7TcT/1 – Daniel May 01 '21 at 08:52
  • Perhaps because the very first match is the _zero match_ and the rest is the _or more_ match. (in regex the asteriks means _zero or more matches_) ? – Theo May 01 '21 at 09:55
  • 2
    @Theo Nope, it's the other way around - first match is `abcdef`, second is the empty string between `f` and the end of the string – Mathias R. Jessen May 01 '21 at 10:10
  • @MathiasR.Jessen God to know!. I came up with that by anchoring to the end `"abcdef" -replace '.*$','x'` --> `xx`, while anchoring to the beginning of the string `"abcdef" -replace '^.*','x'` returned the single `x` – Theo May 01 '21 at 10:15
  • This is what RegexBuddy makes of it https://i.stack.imgur.com/BZE5p.png - the same warning is shown when selecting `.NET` so looks like a general `.NET` thing – Martin Smith May 01 '21 at 10:49
  • Good question; I hope the linked duplicate sheds some more light on the _why_. – mklement0 May 01 '21 at 13:31

2 Answers2

4

Let's find out!

The easiest way to find out what exactly was matched by a regex pattern in any version of PowerShell is by using Regex.Matches():

PS ~> [regex]::Matches('abcdef', '.*')
    
Groups   : {0}
Success  : True
Name     : 0
Captures : {0}
Index    : 0
Length   : 6
Value    : abcdef

Groups   : {0}
Success  : True
Name     : 0
Captures : {0}
Index    : 6
Length   : 0
Value    :

Aha! It's matching the substring abcdef, and then the empty string between f and the end of the string.


In PowerShell 7 we can also use a scriptblock with the replace operator to confirm:

PS ~> "abcdef" -replace '.*',{"['$($_.Value)' (length $($_.Length)) starting at $($_.Index)]"}
['abcdef' (length 6) starting at 0]['' (length 0) starting at 6]

I'm afraid I don't now why the regex engine implementors decided that this behavior was preferable to the behavior of sed, but at least we know what happens now.

Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • Nice demonstration; as for the _why_: long discussion [here](https://stackoverflow.com/q/52369618/45375), but it still doesn't fully make sense to me. – mklement0 May 01 '21 at 13:30
  • 1
    @mklement0 There's a comment there about "posix leftmost longest match" for sed and awk. Regex101.com shows 2 matches. – js2010 May 01 '21 at 14:22
  • 1
    @mklement0 interesting observations. Having not really given it much thought previously my initial instinct was actually "sed is being weird and 'friendly', .NET is acting how I would expect", along the same lines you point out halfway through (ie. "position N is a perfectly valid offset for macthing") – Mathias R. Jessen May 01 '21 at 15:04
3

Select-string showing 2 matches:

# select-string highlights matches in ps 7, but you can't see the 2nd match anyway
'abcdef' | select-string .* -AllMatches | % matches   # 2 matches

Looks like a .Net thing, even in Powershell 7. regex101.com/r/VzxbOT/1 gives 2 matches as well, so maybe it's sed that's wrong ("posix leftmost longest match?" Should .net follow that standard?), since the /g means global or all matches?

[regex]::Replace('abcdef','.*','x')

xx

Replace only one time (Replacing only the first occurrence of a word in a string):

$pattern = [regex]'.*'
$pattern.replace('abcdef','x',1)

x

Search and replace in awk in osx works the same as sed. Only works in bash for some reason. Oh you'd have to backslash the required doublequotes in powershell.

echo 'abcdef' | awk '{ gsub(/.*/,\"x\"); print }'

x
js2010
  • 23,033
  • 6
  • 64
  • 66
  • Yes, `sed` and `awk` (both the BSD/macOS and the GNU/Linux implementations globally match `.*` only _once_; ditto for `mawk`. Python 2.x and Python _up to v3.6_ match only once also, but from what I can tell the majority of engines match _twice_. – mklement0 May 01 '21 at 15:17
  • 1
    An aside re the unfortunate need to `\ `-escape the `"` chars. PowerShell Core 7.2.0-preview.5 introduced experimental feature `PSNativeCommandArgumentPassing`, which makes this no longer necessary; it works robustly on Unix platforms, but on Windows important accommodations are missing; plus, there are currently bugs - see [GitHub issue #15143](https://github.com/PowerShell/PowerShell/issues/15143). – mklement0 May 01 '21 at 15:20
  • Nice, thanks for the regex101 link with explanations. – miwelus May 01 '21 at 19:06