0

I've got the following PS script to strip html from an html formatted email. It however does not strip what appears to be the stylesheets. Hoping someone more knowledgable in this area is willing to assist and or provide some input on fixing that:

$html = @'
'@

# remove line breaks, replace with spaces
#$html = $html -replace "(r|n|t)", " "

# remove invisible content
@('head', 'style', 'script', 'object', 'embed', 'applet', 'noframes', 'noscript', 'noembed') | % {
 $html = $html -replace "<$_[^>]*?>.*?</$_>", ""
}

# Condense extra whitespace
$html = $html -replace "( )+", " "

# Add line breaks
@('div','p','blockquote','h[1-9]') | % { $html = $html -replace "</?$_[^>]*?>.*?</$_>", ("n" + '$0' )} 

# Add line breaks for self-closing tags
@('div','p','blockquote','h[1-9]','br') | % { $html = $html -replace "<$_[^>]*?/>", ('$0' + "n")}

#strip tags 
$html = $html -replace "<[^>]*?>", ""

# write-verbose "removed tags: nn$htmln"
  
# replace common entities
@( 
@("&amp;bull;", " * "),
@("&amp;lsaquo;", "<"),
@("&amp;rsaquo;", ">"),
@("&amp;(rsquo|lsquo);", "'"),
@("&amp;(quot|ldquo|rdquo);", '"'),
@("&amp;trade;", "(tm)"),
@("&amp;frasl;", "/"),
@("&amp;(quot|#34|#034|#x22);", '"'),
@('&amp;(amp|#38|#038|#x26);', "&amp;"),
@("&amp;(lt|#60|#060|#x3c);", "<"),
@("&amp;(gt|#62|#062|#x3e);", ">"),
@('&amp;(copy|#169);', "(c)"),
@("&amp;(reg|#174);", "(r)"),
@("&amp;nbsp;", " "),
@("&amp;(.{2,6});", ""),
@("&nbsp;", " ")
) | % { $html = $html -replace $_[0], $_[1] }

$PlainText=$html

1 Answers1

2

Do not parse html with Regex. See this.

You will run into issues at one point of another because HTML have a lot of specific cases that your regexes will fall through, not accounting for the fact that browser are very lenient when it come to bad HTML. That mean that your HTML might render properly even with improper HTML (eg: not closing div tags / other).

Assuming Windows, using the HTMLFile COM object should work. Replace the source (first statement) by your actual html content and try it out.

  $Source = Invoke-RestMethod 'https://stackoverflow.com/questions/72678474/ps-to-strip-html-from-html-formatted-email'
  $HTML = New-Object -Com "HTMLFile"
  $HTML.write([ref]$source)
  $TextOnly = $Html.body.innerText

  Write-Host $TextOnly -ForegroundColor Cyan

There are also libraries (eg: HTMLAgilityPack) and modules that can do the HTML parsing and deal with all the html parising issue that might occurs.

Sage Pourpre
  • 9,932
  • 3
  • 27
  • 39
  • Hi Sage, thank you for your input. I tried your suggestion above and get the following returned from PS 5.1 (on Windows): At line:1 char:121 + ... ct//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> – JustBry Jun 19 '22 at 18:12
  • Perhaps there's a better way to solve this - the HTML formatted emails contain a table with eight rows and two columns (they're all consistent), I need the data from the table. Suspect there's a way to extract that from each email. – JustBry Jun 19 '22 at 18:43
  • @JustBry I do not have visibility over your test code but based on the fact your error is on line 1 and the error message state that "< is reserved for future use", I suspect you put the html to test in a double-quoted string. Don't do that. Put it in a single-quoted string to avoid any auto-expand behavior (which could have caused part of the string to not be treated as such, such as the "<" symbol that got interpreted as an operator. If you confirm that it was indeed that and it work afterward, then it is possible to use the parsed html from above to extract the table easily. – Sage Pourpre Jun 20 '22 at 03:29