7

I have this PowerShell script that's main purpose is to search through HTML files within a folder, find specific HTML markup, and replace with what I tell it to.

I have been able to do 3/4 of my find and replaces perfectly. The one I am having trouble with involves a Regular Expression.

This is the markup that I am trying to make my regex find and replace:

<a href="programsactivities_skating.html"><br />
                                           </a>

Here is the regex I have so far, along with the function I am using it in:

automate -school "C:\Users\$env:username\Desktop\schools\$question" -query '(?mis)(?!exclude1|exclude2|exclude3)(<a[^>]*?>(\s|&nbsp;|<br\s?/?>)*</a>)' -replace ''

And here is the automate function:

function automate($school, $query, $replace) {
    $processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
    foreach ($file in  $processFiles) {
        $text = Get-Content $file
        $text = $text -replace $query, $replace
        $text | Out-File $file -Force -Encoding utf8
    }
}

I have been trying to figure out the solution to this for about 2 days now, and just can't seem to get it to work. I have determined that problem is that I need to tell my regex to account for Multiline, and that's what I'm having trouble with.

Any help anyone can provide is greatly appreciate.

Thanks in Advance.

Matt Bettiol
  • 309
  • 1
  • 3
  • 9

3 Answers3

20

Get-Content produces an array of strings, where each string contains a single line from your input file, so you won't be able to match text passages spanning more than one line. You need to merge the array into a single string if you want to be able to match more than one line:

$text = Get-Content $file | Out-String

or

[String]$text = Get-Content $file

or

$text = [IO.File]::ReadAllText($file)

Note that the 1st and 2nd method don't preserve line breaks from the input file. Method 2 simply mangles all line breaks, as Keith pointed out in the comments, and method 1 puts <CR><LF> at the end of each line when joining the array. The latter may be an issue when dealing with Linux/Unix or Mac files.

Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
  • 6
    Or if you're on V3 or greater `$text = Get-Content $file -raw`. BTW be careful with that last example as it does NOT preserve line breaks. – Keith Hill Feb 20 '14 at 18:36
1

I don't get what it is you're trying to do with those Exclude elements, but I find multi-line regex is usually easier to construct in a here-string:

$text = @'
<a href="programsactivities_skating.html"><br />
                                       </a>
'@

$regex = @'
(?mis)<a href="programsactivities_skating.html"><br />
\s+?</a>
'@

$text -match $regex

True
mjolinor
  • 66,130
  • 7
  • 114
  • 135
-1

Get-Content will return an array of strings, you want to concatenate the strings in question to create one:

function automate($school, $query, $replace) {
    $processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
    foreach ($file in  $processFiles) {
        $text = ""
        $text = Get-Content $file | % { $text += $_ +"`r`n" }
        $text = $text -replace $query, $replace
        $text | Out-File $file -Force -Encoding utf8
    }
}
Raf
  • 9,681
  • 1
  • 29
  • 41
  • Why not $text = (Get-Content $file) -join "\`r\`n" or as mentioned above: $Text = Get-Content $file | Out-String – dwarfsoft Feb 03 '15 at 02:27