4

I am using the below C# script to remove HTML tags from a description column when running in SSIS. I have tried to add the following unicode &#58 to the string htmlTagPattern below, but I can not get it to work.

Any assistance is appreciated.

public class ScriptMain : UserComponent
{
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {    
         Row.Message = RemoveHtml(Row.Message);
    }
   public String RemoveHtml(String message)
   {
       String htmlTagPattern = "<(.|\n)+?>";
        Regex objRegExp = new Regex(htmlTagPattern);
        message = objRegExp.Replace(message, String.Empty);
        return message;
    }
}
Hadi
  • 36,233
  • 13
  • 65
  • 124
David F
  • 265
  • 2
  • 14
  • 1
    How about System.Web.HttpUtility.HtmlDecode([your html string]) – KeithL Dec 26 '17 at 18:48
  • @DavidF can you add a simple of the data and the expected output, if it is about decoding html you can benefit from `HtmlAgilityPack` or `System.Net` library if using a .Net framework 4 or higher as KeithL suggested – Hadi Dec 26 '17 at 18:48
  • Never use `(.|\n)+?`, it is a performance killer. In your case, use `<[^>]+>` – Wiktor Stribiżew Dec 26 '17 at 20:41
  • As suggested is a data sample and the characters we are removing. div class="ExternalClass4129293D586D41AC85272E1A543E69AE">This is a SharePoint test... The current process to link more than two recipient records is time consuming and requires excessive manual intervention. Make the necessary changes to the linking process to allow two of the multiple records to link, even if more than two records meet the matching criteria. : \n

    – David F Dec 27 '17 at 13:57
  • @DavidF have you tried my suggestions – Hadi Dec 27 '17 at 18:11
  • Yes. Thanks for the advice. – David F Dec 28 '17 at 20:47

1 Answers1

0

There are many methods to convert HTML to plain text:

Using HTMLAgilityPack Library

You can get the code from the Samples provided:

You can download HTMLAgilitypack from the following Links:

Using System.Net

If you are using .Net framework 4 or highr, you can benefits from the System.Net library which contains method to get the plain text from HTML:

System.Net.HttpUtility.HtmlDecode(Row.Column)

Reference:

Using Regular expressions

You can follow one of these links for more details:

Hadi
  • 36,233
  • 13
  • 65
  • 124