0

I have a string that contains a bunch of MS Word garbage like this:

<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>

</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:View>Normal</w:View>
  <m:mathPr>
   <m:mathFont m:val="Cambria Math"/>
   <m:brkBin m:val="before"/>
  </m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]>

<style>
 /* Style Definitions */
 table.MsoNormalTable
    {mso-style-name:"Table Normal";
    mso-tstyle-rowband-size:0;
    mso-hansi-font-family:Calibri;
    mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->

I've tried the function below to remove it, but they only remove parts and leave a ton of white space:

Public Function CleanOfficeJunk(html As String) As String
    ' start by completely removing all unwanted tags 
    html = System.Text.RegularExpressions.Regex.Replace(html, "<[/]?(font|span|xml|del|ins|[ovwxp]:\w+)[^>]*?>", "", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
    ' then run another pass over the html (twice), removing unwanted attributes 
    html = System.Text.RegularExpressions.Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
    html = System.Text.RegularExpressions.Regex.Replace(html, "<([^>]*)(?:class|lang|style|size|face|[ovwxp]:\w+)=(?:'[^']*'|""[^""]*""|[^\s>]+)([^>]*)>", "<$1$2>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
    Return html
End Function

I'm using this in a SQL Server Reporting Service(SSRS) report and need to clean the strings before I display them in a textbox.

Is there a better way of removing stuff like this?

edit: I did see this post Remove HTML comments with Regex, in Javascript

But the accepted answer didn't seem to work in my situation.

Community
  • 1
  • 1
SkyeBoniwell
  • 6,345
  • 12
  • 81
  • 185
  • 1
    Is it possible your string could contain good `< OR >`. As if not why dont you just remove everything enclosed in `< and >`... – Trevor Feb 04 '16 at 17:16
  • It could contain a
    here or there. Thanks
    – SkyeBoniwell Feb 04 '16 at 17:25
  • 1
    Last question, do you just want only the text between the tag you want to specify? Basically only for the tags you allow – Trevor Feb 04 '16 at 17:57
  • @Codexer I just want to remove all the stuff between xml and style tags. If there's something like, hi there... then I'd want to keep that thanks – SkyeBoniwell Feb 04 '16 at 18:09

1 Answers1

0

You should try setting PlaceHolder property to HTML. That fixed my problem.

enter image description here

Altaf Patel
  • 1,351
  • 1
  • 17
  • 28