3

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.

Crazy little project which maybe one day the classes will come uin handy to use again for something more important.

I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,

What is the best way of removing these unwanted characters and div's?

Thanks,

Ash

Ash
  • 8,583
  • 10
  • 39
  • 52

4 Answers4

4

IMHO the easiest way is to use regular expressions. Something like:

string txt = Regex.Replace(htmlString, @"<(.|\n)*?>", string.Empty);

Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.

SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.

Community
  • 1
  • 1
f3lix
  • 29,500
  • 10
  • 66
  • 86
  • This is the answer worked but it basically removes all the chatacters then puts in their place a blank space which once split into an array give alot of white space in the database. How do i solve that? Also is there any way of adding a parameter to this to remove characters like /n and /t? – Ash Mar 30 '09 at 10:35
  • Not sure why you're seeing *extra* blank space - string.Empty would be replacing the tags with "", not " ". it's possible that you're not stripping out the excess whitespace (tabs "\t", newlines "\n", etc) in the RSS - you might want to look at doing a further replace for those, or adding them. – Zhaph - Ben Duguid Mar 30 '09 at 10:42
4

If you want to remove the DIV tags WITH content as well:

string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);

Input: <xml><div>junk</div>XXX<div>junk2</div></xml>

Output: <xml>XXX</xml>

Wolf5
  • 16,600
  • 12
  • 59
  • 58
  • Ohhh okays i see so your defining the start and end tag and erasing all of it basically! Thats awesome exactly what i needed thanks! – Ash Mar 30 '09 at 10:56
2

Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.

The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)

Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.

AmitK
  • 76
  • 3
1

A regular expression such as this:

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> 

Would highlight all HTML tags.

Use this to remove them form your data.

Jon Winstanley
  • 23,010
  • 22
  • 73
  • 116
  • Is there a certain order you must put the characters in when writing a regular expression? The answer towards the top is that a lighter expression? Or does it not remove all characters? – Ash Mar 30 '09 at 10:46
  • To be honest, the regex I mentioned here will remove all content within tags as well. This may not be what you want. – Jon Winstanley Mar 30 '09 at 13:38