1

Following Regex:

(?<=href(\s+)?=(\s+)?")(?!(\s+)?http)(?!//).+(?=")

Works as expected with test articles:

href="//www.google-analytics.com/analytics.js">
href="https://www.google-analytics.com/analytics.js">
href="index.html">
href="..\index.html">
href="main.css">
href="..\assets\main.css">
href = " ..\assets\main.css ">

As you may see here: https://t.co/PC0U9br3vn

However:

[$string] $string = Get-Content sample.txt

[$string] $regex = '(?<=href(\s+)?=(\s+)?")(?!(\s+)?http)(?!(\s+)?//)(?!(\s+)?mailto).+(?=")'

$newString = $string -replace $regex, "..\$&"

$string
$newString

Produces the following output:

//www.google-analytics.com/analytics.js">  href=" https://www.google-analytics.com/analytics.js">  href="index.html">  href="..\index.html">  href="  main.css">  href="..\assets\main.css">  href = " ..\assets\main.css ">  href = "mailto://email@domain ">  href = "..\..\..\assets\main.css"
//www.google-analytics.com/analytics.js">  href=" https://www.google-analytics.com/analytics.js">  href="..\index.html">  href="..\index.html">  href="  main.css">  href="..\assets\main.css">  href = " ..\assets\main.css ">  href = "mailto://email@domain ">  href = "..\..\..\assets\main.css"

As only the first article is being operated on.

The same script is working elsewhere where the replace string does not utilise regex and is a simple string.

Craig.C
  • 561
  • 4
  • 17
  • 4
    You should never use `.*` or `.+` and other variations if you may need multiple matches as those greedy constructs eat up too many characters (up to the end of line/string). Use negated character class to limit matching to just inside the double quotes: change `.+(?=")` to `[^"]+`. – Wiktor Stribiżew Nov 22 '15 at 11:53
  • Thanks @stribizhev this is very helpful. I suspected this was not the best way. However I'm just starting out with regex and I thought I would have to write a complex inclusion group e.g. [a-zA-Z0-9&:?] etc. etc. and I lost heart. Much better with this concise exclusion set. Perhaps repeated use of (\s+)? could be replaced with a general ignore white-space parameter? – Craig.C Nov 23 '15 at 18:36

2 Answers2

2

Input is of the wrong type:

[$string] $string = Get-Content sample.txt

However and array of strings works:

[$string[]] $string = Get-Content sample.txt
Damian Kozlak
  • 7,065
  • 10
  • 45
  • 51
Craig.C
  • 561
  • 4
  • 17
1

All you need is a negated character class [^"]+ (see this post of mine where I explain how \[^"\]+ works). However, also note that (\s+)? is the same as \s*. No need to overstuff your regex with capturing groups if you are not planning to use them.

Use

(?<=href\s*=\s*")(?!\s*http)(?!//)[^"]+

See regex demo

Here is what it matches:

  • (?<=href\s*=\s*") - if there is href followed by 0 or more whitespace symbols, followed with = and then again 0 or more whitespace before...
  • (?!\s*http) - and if there is no 0 or more whitespace followed by http right after the current position, and...
  • (?!//) - if there is no // right after the current position...
  • [^"]+ - match 1 or more characters other than ".
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Lovely detailed info. Particularly in the link. Thanks so much @strib – Craig.C Nov 24 '15 at 08:23
  • It was certainly very helpful and has improved this ugly regex. With you suggestion it still works when there are many targets on an unbroken line rather than selecing everything up until the final ". This would have been my next stack overflow question as I had just noticed this unwanted sideeffect of the .+ statement. Therefore I upvoted. However the primary question was why I my ps script was only operating, as it was doing so correctly as there was only one target per line, on the first instance and I believe this was because I was using the wrong object type. Thank you for being so hel – Craig.C Nov 25 '15 at 08:09
  • True. Glad you found out how to fix that. – Wiktor Stribiżew Nov 25 '15 at 08:11