VB.Net - Regex to clean HTML Table

Question

 Regex.Replace(Regex.Replace(Replace(Replace(Replace _
(Replace(Replace(Regex.Replace(HTML, "(<[a|A][^>]*>|)", ""), _
"<TBODY>", ""), "<THEAD>", ""), "</TBODY>", ""), _
"</THEAD>", ""), "</A>", ""), "( .*=['""][^'""]+?['""])([^/>]*)(?=/?>|\s)", ""), _
"</?SPAN( [^>]*|/)?>", "")

I am using the above to clean html table. It is very long and inefficient. Clearing span also not working properly. I am not good with regex so that's all I could find from internet. I am looking for a regex combo:

Remove a, tbody, thead tags with enclosing tags EXCEPT their inner html and text.
Remove every attribute with name and value in every element.
Remove span nodes completely WITH inner html and text.
Do the above three regardless case sensivity.

I need nothing more than table, th, tr and td tags and their inner texts without anchor links.

EDIT: These are the exact expressions that I need. Is there anyway to make them a single pattern?

HTML = Regex.Replace(HTML, "(<(a|A)[^>]*>|</(a|A)>)", "")
HTML = Regex.Replace(HTML, "(<(tbody|TBODY)[^>]*>|</(tbody|TBODY)>)", "")
HTML = Regex.Replace(HTML, "(<(thead|THEAD)[^>]*>|</(thead|THEAD)>)", "")
HTML = Regex.Replace(HTML, "( .*=['""][^'""]+?['""])([^/>]*)(?=/?>|\s)", "")
HTML = Regex.Replace(HTML, "(<(span|SPAN))[^>]*?>.*?</((span|SPAN)>)", "")

Any regex for this task is likely to be "*very long and inefficient.*" as you say. Use an HTML parser instead. — Amal Murali, Jun 17 '14 at 17:04
[Look at one of the most upvoted and visited question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) on StackOverflow. Then reevaluate your choice — Steve, Jun 17 '14 at 17:10
Not only will a regular expression be long and inefficient. It will also be a convoluted, bug-ridden, unmaintainable, bug-ridden monstrosity that will get it right in 90% of all possible cases *at maximum*. Yes. I know I mentioned bug-ridden twice. — Tomalak, Jun 17 '14 at 18:31
That is easily the most cringe-worthy, heart-stopping code block I've ever seen on StackOverflow. I wonder how many hours have crept into "developping" that piece of mould! Anyway, don't be surprised if your computer implodes somewhere in the near future. — MarioDS, Jun 18 '14 at 12:07

VB.Net - Regex to clean HTML Table

0 Answers0