0

Im trying to figure out a way to strip out all html tags from records in a database, then create xml?

Any ideas?

Built on asp.net 2.0 with sql server

Mike Atlas
  • 8,193
  • 4
  • 46
  • 62
jrutter
  • 3,173
  • 9
  • 42
  • 51

3 Answers3

1

Check this question : Using C# regular expressions to remove HTML tags. What exactly did you mean by creating xml?

Community
  • 1
  • 1
Shoban
  • 22,920
  • 8
  • 63
  • 107
  • 1
    Well, we need to deliver an xml feed of all our products to a vendor and they want us to strip out all the html characters. So Im wondering if there is an easy way to do that? – jrutter Nov 15 '09 at 23:37
0

Why not just parse the page, ensuring that you make it into a DOM tree, and then just go through the elements pulling out the appropriate values that you need, and perhaps any attributes you deem necessary.

If you wrote the html files then they should be well-formed, so this would be easy.

James Black
  • 41,583
  • 10
  • 86
  • 166
  • I like this answer. You need to implement DOM objects and parser, after all HTML is some soft of XML. You basically need to convert the HTML tags to XML tags, so while you are parsing it, you can replace the HTML tags with XML tags. – DarthVader Nov 16 '09 at 01:57
0

Don't strip the HTML with the database or with sql. Instead, strip it out at the last mile in your application code with a scraper.

Google this: "HTML Scraper". HTML screen scraping tools read HTML content and output the content, less the HTML. Or, alternatively, Stack Overflow this: "Screen-scraping HTML".

Community
  • 1
  • 1
Mike Atlas
  • 8,193
  • 4
  • 46
  • 62
  • Don't tell him, google this (even if thats what he should have done), point him straight here at stackoverflow for that ;) http://stackoverflow.com/search?q=html+scraper – Esteban Küber Nov 16 '09 at 02:02