1

I have a file that has the following below, I am trying to remove everything from <!-- to -->

<!--<br>
/* Font Definitions */

-->
Only keep this part 
Andy Arismendi
  • 50,577
  • 16
  • 107
  • 124
Luigi
  • 11
  • 2
  • Hi I am new to this site I am using powershell I didn't meant to put r there. Basically I am trying to remove html from a file. – Luigi Aug 14 '14 at 02:59
  • I've updated the tag so the correct people see this. – MrFlick Aug 14 '14 at 03:01
  • Don't use a regex. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Aaron Jensen Aug 25 '14 at 16:21

2 Answers2

1

Don't use a regex. HTML isn't a regular language, so it can't be properly parsed with a regex. It will succeed most of the time, but other times will fail. Spectacularly.

I recommend cracking open the file, and reading it a character at at time, looking for the characters <, !, -, followed by -. Then, continue reading until you find -, -, !, followed by >.

$chars = [IO.File]::ReadAllText( $path ).ToCharArray()
$newFileContent = New-Object 'Text.StringBuilder'
for( $i = 0; $i -lt $chars.Length; ++$i )
{
    if( $inComment )
    {
        if( $chars[$i] -eq '-' -and $chars[$i+1] -eq '-' -and $chars[$i+2] -eq '!' -and $chars[$i+3] -eq '>' )
        {
            $inComment = $false
            $i += 4
        }
        continue
    }

    if( $chars[$i] -eq '<' -and $chars[$i+1] -eq '!' -and $chars[$i+2] -eq '-' -and $chars[$i+3] -eq '-' )
    {
        $inComment = $true
        $i += 4
        continue
    }

    $newFileContent.Append( $chars[$i] )
}
$newFileContent.ToString() | Set-Content -Path $path
Community
  • 1
  • 1
Aaron Jensen
  • 25,861
  • 15
  • 82
  • 91
0

Regular expressions to the rescue again -

@'
<!--<br>
/* Font Definitions */

-->
Only keep this part 
'@ -replace '(?s)<!--(.+?)-->', ''

(?s) makes dot match new lines :)

Andy Arismendi
  • 50,577
  • 16
  • 107
  • 124