Finding certain text and filtering the rest out

Question

Let's say I have this string(huge), and I want to filter out everything but what I'm looking for. Here's an example of what I want:

<strong>You</strong></font> <font size="3" color="#05ABF8">
<strong>Shook</strong></font> Me All <font size="3" color="#05ABF8">
<strong>Night</strong></font> <font size="3" color="#05ABF8">
<strong>Long</strong></font> mp3</a></div>

As you can see, There's text in between all that. I want to get "You Shook Me All Night Long" and take out the rest. How will I go by accomplishing this?

Read this http://stackoverflow.com/questions/4878452/remove-html-tags-in-string and/or http://stackoverflow.com/questions/787932/using-c-sharp-regular-expressions-to-remove-html-tags — Mate, Oct 28 '12 at 02:30
@ Maxim, yes I wanted that too. Anyways, Mate's response got me the help I needed. Here's the code I found helpful: `String result = Regex.Replace(htmldoc, @"<[^>]*>", String.Empty);`. Thanks. — user1667191, Oct 28 '12 at 02:41

maximpa · Accepted Answer · 2012-10-28T11:30:29.410

You can use the following regex: >([\s|\w]+)<

var input = @"
<strong>You</strong></font> <font size='3' color='#05ABF8'>
<strong>Shook</strong></font> Me All <font size='3' color='#05ABF8'>
<strong>Night</strong></font> <font size='3' color='#05ABF8'>
<strong>Long</strong></font> mp3</a></div>";

var regex = new Regex(@">(?<match>[\s|\w]+)<");

var matches = regex.Matches(input).Cast<Match>()
   // Get only the values from the group 'match'
   // So, we ignore '<' and '>' characters
   .Select(p => p.Groups["match"].Value);

Matches

// Concatenate the captures to one string
var result = string.Join(string.Empty, matches)
    // Remove unnecessary carriage return characters if needed
    .Replace("\r\n", string.Empty);

The result

score 1 · Answer 2 · answered Oct 28 '12 at 02:40

1

Assuming you have valid start tags for the ending </a></div> at the end of your xml/html you posted.

string value = XElement.Parse(string.Format("<root>{0}</root>", yourstring)).Value;

Or a method that strips Html:

public static string StripHTML(this string HTMLText)
{
    var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    return reg.Replace(HTMLText, "").Replace("&nbsp;", " ");
}

answered Oct 28 '12 at 02:40

Chuck Savage

11,775
6
49
69

You should not use LINQ2XML to parse an html – Anirudha Oct 28 '12 at 05:38

Finding certain text and filtering the rest out

2 Answers2