4

We have a directory of 3000+ HTML files that are migrating to a sharepoint site, and we need to scrub some of the data.

Specific situations:

  • About 1/3 of the files include an XML header <?xml version="1.0" encoding="utf-8"?> that sharepoint doesn't like. We plan to just delete that header line.
  • Every file has javascript parameters for "HOME" that point to two alternate relative homepage links foo1.htm or foo.htm. We want to change both to an absolute link of http:\\sharepoint.site\home.aspx
  • Every file also has a javascript link parameter "Show" that we just want to hide by changing it to ''.

Here's my function so far:

function scrubXMLHeader {
    $srcfiles     = Get-ChildItem $backupGuidePath -filter "*htm.*"                              
    $srcfilecount = (Get-ChildItem $backupGuidePath).Count                                       
    $selfilecount = $srcfiles.Count                                                              
    # Input and Ouput Path variables
    $sourcePath        = $backupGuidePath 
    $destinationPath   = $workScrubPath
    "Input From: $($sourcePath)" | Log $messagLog -echo          
    " Output To: $($destinationPath)" | Log $messageLog -echo
    #
    $temp01 = Get-ChildItem $sourcePath -filter "*.htm"
    foreach($file in $temp01)
    {
        $outfile = $destinationPath + $file
        $content = Get-Content $file.Fullname | ? {$_ -notmatch "<\?xml[^>]+>" } 
        Set-Content -path $outfile -Force -Value $content
    }
}

I want to add the following two edits to each document:

-replace '("foo.htm", "", ">", "Home", "foo1.htm")', '("http:\\sharepoint.site\home.aspx", "", ">", "Home", "http:\\sharepoint.site\home.aspx")
-replace 'addButton("show",BTN_TEXT,"Show","","","","",0,0,"","","");', ''

I'm not sure how to combine those into a single statement so I open the file, perform the changes, save and close the file instead of three separate open-edit-save/close transactions. I'm also not sure, with all the quotes and commas, the best way to escape these characters, or if the single quotes surrounding the whole string are sufficient.

Understanding that "asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML", but being limited in my toolset to PowerShell, I'm trying to understand the best way to add the two -replace lines to the existing $content variable...separated by commas within the curly braces? piped to each other?

Is the following these best strategy? or is there something better?

$content = Get-Content $file.Fullname | ? {$_ -notmatch "<\?xml[^>]+>", 
    -replace '("foo.htm", "", ">", "Home", "foo1.htm")', '("http:\\sharepoint.site\home.aspx", "", ">", "Home", "http:\\sharepoint.site\home.aspx"),
    -replace 'addButton("show",BTN_TEXT,"Show","","","","",0,0,"","","");', '' } 
Community
  • 1
  • 1
dwwilson66
  • 6,806
  • 27
  • 72
  • 117
  • 1
    Every time you [parse HTML with a regular expression](http://stackoverflow.com/a/1732454/1630171), a catgirl dies somewhere. [Proper tools](http://stackoverflow.com/a/20644942/1630171) are at your disposal. Use them. – Ansgar Wiechers Dec 19 '13 at 20:04
  • SAVE THE CATGIRLS! Unfortunately, my tool set extends to powershell v1.0; I'm in a user area and locked down more than I want to be to do my job. I'm sure if IT could find a way to extricate PS from Win7 to prevent my using it, they would. None of the proper tools you mention are available to me because I don't have proper permissions...don't get me started on that. – dwwilson66 Dec 19 '13 at 20:26
  • `Tidy` is optional for prettifying the code. The rest is built into Windows/PowerShell. – Ansgar Wiechers Dec 20 '13 at 12:42

1 Answers1

2

If I'm reading the question correctly, I think this might do what you want:

$Regex0 = '<?xml version="1.0" encoding="utf-8"?> '

$Regex1 =  '("foo.htm", "", ">", "Home", "foo1.htm")'
$Replace1 =  '("http:\\sharepoint.site\home.aspx", "", ">", "Home", "http:\\sharepoint.site\home.aspx")'

$Regex2 = 'addButton("show",BTN_TEXT,"Show","","","","",0,0,"","","");'


foreach($file in $temp01)
    {
        $outfile = $destinationPath + $file
        (Get-Content $file.Fullname) -notmatch $Regex0,'' -replace $Regex1,$Replace1 -replace $Regex2,'' |
         Set-Content -path $outfile -Force -Value $content
    }
mjolinor
  • 66,130
  • 7
  • 114
  • 135
  • In theory, yes, but without the regex. :) The line USING the regex was elegant for me because it just excludes the line that matches the pattern; I was trying to figure out how to add the other two `-replace` lines in with that...can a series of staements be included in the curly braces and separated by commas? the results of each pass piped to the next `-replace`? – dwwilson66 Dec 19 '13 at 20:20
  • updated the script. You can chain match/notmatch and -replace operators, and the filtered/replaced results will be passed on to the next operator, so you don't need the pipeline in between. – mjolinor Dec 19 '13 at 20:24
  • aha...that makes sense. thanks. I also updated my question to make the specifics clearer, and make it evident that I was not trying to kill catgirls with regex. :) – dwwilson66 Dec 19 '13 at 20:30
  • You might kill them anyway. They've got nine lives, but you've got a foreach loop, so it'll come down to how many files you've got. – mjolinor Dec 19 '13 at 21:54