removing MS Office junk from a text string

Question

I have a string that contains a bunch of MS Word garbage like this:

<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>

</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:View>Normal</w:View>
  <m:mathPr>
   <m:mathFont m:val="Cambria Math"/>
   <m:brkBin m:val="before"/>
  </m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]>

<style>
 /* Style Definitions */
 table.MsoNormalTable
    {mso-style-name:"Table Normal";
    mso-tstyle-rowband-size:0;
    mso-hansi-font-family:Calibri;
    mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->

I've tried the function below to remove it, but they only remove parts and leave a ton of white space:

Public Function CleanOfficeJunk(html As String) As String
    ' start by completely removing all unwanted tags 
    html = System.Text.RegularExpressions.Regex.Replace(html, "<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
    ' then run another pass over the html (twice), removing unwanted attributes 
    html = System.Text.RegularExpressions.Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
    html = System.Text.RegularExpressions.Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
    Return html
End Function

I'm using this in a SQL Server Reporting Service(SSRS) report and need to clean the strings before I display them in a textbox.

Is there a better way of removing stuff like this?

edit: I did see this post Remove HTML comments with Regex, in Javascript

But the accepted answer didn't seem to work in my situation.

Is it possible your string could contain good `< OR >`. As if not why dont you just remove everything enclosed in `< and >`... — Trevor, Feb 04 '16 at 17:16
Last question, do you just want only the text between the tag you want to specify? Basically only for the tags you allow — Trevor, Feb 04 '16 at 17:57
@Codexer I just want to remove all the stuff between xml and style tags. If there's something like, hi there... then I'd want to keep that thanks — SkyeBoniwell, Feb 04 '16 at 18:09

score 0 · Answer 1 · answered Oct 16 '17 at 13:59

0

You should try setting PlaceHolder property to HTML. That fixed my problem.

answered Oct 16 '17 at 13:59

Altaf Patel

1,351
1
17
28

removing MS Office junk from a text string

1 Answers1