3

I using regex to convert html to plain text.

Can you help me to remove line blank with regex

My html:

<div class="short-description">
<div class="short-description">
<div class="short-description">
<div class="short-description">
<div class="short-description">
<div class="short-description">
<div class="short-description">
<div class="short-description">
<div class="short-description">
<ul style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana; color: #000000; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #ffffff;">
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">Processor: Intel® Xeon® E5-2403 1.80GHz, 10M Cache, 6.4GT/s QPI, No Turbo, 4C, 80W, Max Mem 1066MHz</li>
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">Memory:&nbsp; 8GB (4x2GB) 1333MHz, Single Ranked LV RDIMMs up to 16GB</li>
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">Hard Drive: 1TB 7.2K RPM NL SAS 3.5-inch Hot Plug</li>
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">Storage Controller: H310 raid controller Support RAID 0, 1, 5, 10</li>
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">File Access Protocols: CIFS, NFS, FTP, SMB3.0, SMB Direct (RDMA)</li>
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">Internal Drive Support: 4 x 3.5" hot-plug drive bays</li>
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">Power: 1 x 550W Power Supply (redundant)</li>
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">OS: Window Storage 2008 Workgroup R2 Edition</li>
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">Form Factor 1U rack mount system</li>
<li style="margin: 0px; padding: 0px; font-family: Tahoma, Verdana !important;">Warranty: 3 Year ProSupport and NBD On-site Service</li>
</ul>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
                            </div>

And my regex:

Regex.Replace(Model.MetaDescription, @"<(.|\n)*?>","")

This result (image): Result regex.replace

How like in bellow image Result regex.replace

TRI ÂN
  • 59
  • 7

3 Answers3

1

As it has mentioned here, you can use the free and open source HtmlAgilityPack. Check the sample

a method that converts from HTML to plain text.

var plainText = ConvertToPlainText(string html);

Feed it an HTML string like

<b>hello world!</b><br /><i>it is me! !</i>

And you'll get a plain text result like:

hello world!
it is me!
Community
  • 1
  • 1
Ghasem
  • 14,455
  • 21
  • 138
  • 171
0

If I understand the question, you want to remove anything between angle brackets <> and also remove newlines, then try this regex

@"<[^>]*>|\n"

However, as Alex Jolig suggests, use HTML Agility Pack.

Richard Schneider
  • 34,944
  • 9
  • 57
  • 73
0

Don't use RegEx with HTML. RegEx is for regular languages and HTML isn't one. You should use HtmlAgilityPack to parse HTML.

It becomes very easy:

var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);

string[] lines =
    document
        .DocumentNode
        .Descendants("li")
        .Select(x => System.Net.WebUtility.HtmlDecode(x.InnerText))
        .ToArray();

string text = String.Join(Environment.NewLine, lines);

With that I get:

Processor: Intel® Xeon® E5-2403 1.80GHz, 10M Cache, 6.4GT/s QPI, No Turbo, 4C, 80W, Max Mem 1066MHz
Memory:  8GB (4x2GB) 1333MHz, Single Ranked LV RDIMMs up to 16GB
Hard Drive: 1TB 7.2K RPM NL SAS 3.5-inch Hot Plug
Storage Controller: H310 raid controller Support RAID 0, 1, 5, 10
File Access Protocols: CIFS, NFS, FTP, SMB3.0, SMB Direct (RDMA)
Internal Drive Support: 4 x 3.5" hot-plug drive bays
Power: 1 x 550W Power Supply (redundant)
OS: Window Storage 2008 Workgroup R2 Edition
Form Factor 1U rack mount system
Warranty: 3 Year ProSupport and NBD On-site Service
Enigmativity
  • 113,464
  • 11
  • 89
  • 172