1

i'm trying to delete the "unwanted" class lines from an HTML file using power shell script

<a class="unwanted" href="http://www.mywebsite.com/rest/of/url1" target="_blank">my_file_name1</a><br>
<a class="mylink" href="http://www.mywebsite.com/rest/of/url2" target="_blank">my_file_name2</a><br>
<a class="unwanted" href="http://www.mywebsite.com/rest/of/url3" target="_blank">my_file_name3</a><br>

Currently i'm replacing strings using this script

$s = "old string"
$r = "new string"

Get-ChildItem "C:\Users\User\Desktop\Folder" -Recurse -Filter *.html | % {
  (Get-Content $_.FullName) `
    | % { $_ -replace [regex]::Escape($s), $r } `
    | Set-Content $_.FullName
}
Ruben Bartelink
  • 59,778
  • 26
  • 187
  • 249
M. A.
  • 424
  • 6
  • 21

3 Answers3

2

Since you tagged your question also with and , I want to contribute a related answer.

cmd.exe/batch scripting does not understand HTML file format, but if your HTML file(s) look(s) like the sample data you provided (the <a> tag and the corresponding </a> tag are in a single line, and there is nothing else (than <br>)), the following command line could work for you -- supposing a HTML file to process is called classes.html and the modified data is to be written to file classes_new.html:

> "classes_new.html" findstr /V /I /L /C:"class=\"unwanted\"" "classes.html"

This only works if the string class="unwanted" occurs only in the <a> tags that need to be removed.


To process multiple files, the following batch script could be used, based on the above command line:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

set "ARGS=%*"
setlocal EnableDelayedExpansion
for %%H in (!ARGS!) do (
    endlocal
    call :SUB "%%~H"
    setlocal
)
endlocal

endlocal
exit /B

:SUB file
if /I not "%~x1"==".html" if /I not "%~x1"==".htm" exit /B 1
findstr /V /I /L /C:"class=\"unwanted\"" "%~f1" | (> "%~f1" find /V "")
exit /B

The actual removal of lines is done in the sub-routine :SUB, unless then file name extension is something other than .html or htm. The main script loops through all the given command line arguments and calls :SUB for every single file. Note that this script does not create new files for the modified HTML contents, it overwrites the given HTML files.

aschipfl
  • 33,626
  • 12
  • 54
  • 99
1

Removing lines is even easier than replacing them. When outputting to Set-Content, simply omit the lines that you want removed. You can do this with Where-Object in place of your Foreach.

Adapting your example:

$s = "unwanted regex"

Get-ChildItem "C:\Users\User\Desktop\Folder" -Recurse -Filter *.html | % {
  (Get-Content $_.FullName) `
    | where { $_ -notmatch $s } `
    | Set-Content $_.FullName
}

If you want literal matching instead of regex, substitute the where clause

where { -not $_.Contains($s) } `

Note this is using the .NET function [String]::Contains(), and not the PowerShell operator -contains, as the latter doesn't work on strings.

Ryan Bemrose
  • 9,018
  • 1
  • 41
  • 54
-1

Try using multiline strings for your $s and $r. I tested with the HTML examples you posted as well and that worked fine.

$s = @"
old string
"@
$r = @"
new string
"@

Get-ChildItem "C:\Users\User\Desktop\Folder" -Recurse -Filter *.html | % {
  (Get-Content $_.FullName) `
    | % { $_ -replace $s, $r } `
    | Set-Content $_.FullName
}
xXhRQ8sD2L7Z
  • 1,686
  • 1
  • 14
  • 16