We have a directory of 3000+ HTML files that are migrating to a sharepoint site, and we need to scrub some of the data.
Specific situations:
- About 1/3 of the files include an XML header
<?xml version="1.0" encoding="utf-8"?>
that sharepoint doesn't like. We plan to just delete that header line. - Every file has javascript parameters for "HOME" that point to two alternate relative homepage links
foo1.htm
orfoo.htm
. We want to change both to an absolute link ofhttp:\\sharepoint.site\home.aspx
- Every file also has a javascript link parameter "Show" that we just want to hide by changing it to
''
.
Here's my function so far:
function scrubXMLHeader {
$srcfiles = Get-ChildItem $backupGuidePath -filter "*htm.*"
$srcfilecount = (Get-ChildItem $backupGuidePath).Count
$selfilecount = $srcfiles.Count
# Input and Ouput Path variables
$sourcePath = $backupGuidePath
$destinationPath = $workScrubPath
"Input From: $($sourcePath)" | Log $messagLog -echo
" Output To: $($destinationPath)" | Log $messageLog -echo
#
$temp01 = Get-ChildItem $sourcePath -filter "*.htm"
foreach($file in $temp01)
{
$outfile = $destinationPath + $file
$content = Get-Content $file.Fullname | ? {$_ -notmatch "<\?xml[^>]+>" }
Set-Content -path $outfile -Force -Value $content
}
}
I want to add the following two edits to each document:
-replace '("foo.htm", "", ">", "Home", "foo1.htm")', '("http:\\sharepoint.site\home.aspx", "", ">", "Home", "http:\\sharepoint.site\home.aspx")
-replace 'addButton("show",BTN_TEXT,"Show","","","","",0,0,"","","");', ''
I'm not sure how to combine those into a single statement so I open the file, perform the changes, save and close the file instead of three separate open-edit-save/close transactions. I'm also not sure, with all the quotes and commas, the best way to escape these characters, or if the single quotes surrounding the whole string are sufficient.
Understanding that "asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML", but being limited in my toolset to PowerShell, I'm trying to understand the best way to add the two -replace
lines to the existing $content
variable...separated by commas within the curly braces? piped to each other?
Is the following these best strategy? or is there something better?
$content = Get-Content $file.Fullname | ? {$_ -notmatch "<\?xml[^>]+>",
-replace '("foo.htm", "", ">", "Home", "foo1.htm")', '("http:\\sharepoint.site\home.aspx", "", ">", "Home", "http:\\sharepoint.site\home.aspx"),
-replace 'addButton("show",BTN_TEXT,"Show","","","","",0,0,"","","");', '' }