0

Let's say I have this string(huge), and I want to filter out everything but what I'm looking for. Here's an example of what I want:

<strong>You</strong></font> <font size="3" color="#05ABF8">
<strong>Shook</strong></font> Me All <font size="3" color="#05ABF8">
<strong>Night</strong></font> <font size="3" color="#05ABF8">
<strong>Long</strong></font> mp3</a></div>

As you can see, There's text in between all that. I want to get "You Shook Me All Night Long" and take out the rest. How will I go by accomplishing this?

maximpa
  • 1,958
  • 13
  • 16
user1667191
  • 2,387
  • 6
  • 20
  • 25
  • 5
    Read this http://stackoverflow.com/questions/4878452/remove-html-tags-in-string and/or http://stackoverflow.com/questions/787932/using-c-sharp-regular-expressions-to-remove-html-tags – Mate Oct 28 '12 at 02:30
  • and/or http://stackoverflow.com/a/1732454/414076 – Anthony Pegram Oct 28 '12 at 02:35
  • What about "mp3", do you need to capture it too? – maximpa Oct 28 '12 at 02:37
  • 1
    @ Maxim, yes I wanted that too. Anyways, Mate's response got me the help I needed. Here's the code I found helpful: `String result = Regex.Replace(htmldoc, @"<[^>]*>", String.Empty);`. Thanks. – user1667191 Oct 28 '12 at 02:41

2 Answers2

3

You can use the following regex: >([\s|\w]+)<

var input = @"
<strong>You</strong></font> <font size='3' color='#05ABF8'>
<strong>Shook</strong></font> Me All <font size='3' color='#05ABF8'>
<strong>Night</strong></font> <font size='3' color='#05ABF8'>
<strong>Long</strong></font> mp3</a></div>";

var regex = new Regex(@">(?<match>[\s|\w]+)<");

var matches = regex.Matches(input).Cast<Match>()
   // Get only the values from the group 'match'
   // So, we ignore '<' and '>' characters
   .Select(p => p.Groups["match"].Value);

Matches

// Concatenate the captures to one string
var result = string.Join(string.Empty, matches)
    // Remove unnecessary carriage return characters if needed
    .Replace("\r\n", string.Empty);

The result

maximpa
  • 1,958
  • 13
  • 16
1

Assuming you have valid start tags for the ending </a></div> at the end of your xml/html you posted.

string value = XElement.Parse(string.Format("<root>{0}</root>", yourstring)).Value;

Or a method that strips Html:

public static string StripHTML(this string HTMLText)
{
    var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    return reg.Replace(HTMLText, "").Replace("&nbsp;", " ");
}
Chuck Savage
  • 11,775
  • 6
  • 49
  • 69