Regex.Replace(Regex.Replace(Replace(Replace(Replace _
(Replace(Replace(Regex.Replace(HTML, "(<[a|A][^>]*>|)", ""), _
"<TBODY>", ""), "<THEAD>", ""), "</TBODY>", ""), _
"</THEAD>", ""), "</A>", ""), "( .*=['""][^'""]+?['""])([^/>]*)(?=/?>|\s)", ""), _
"</?SPAN( [^>]*|/)?>", "")
I am using the above to clean html table. It is very long and inefficient. Clearing span also not working properly. I am not good with regex so that's all I could find from internet. I am looking for a regex combo:
- Remove a, tbody, thead tags with enclosing tags EXCEPT their inner html and text.
- Remove every attribute with name and value in every element.
- Remove span nodes completely WITH inner html and text.
- Do the above three regardless case sensivity.
I need nothing more than table, th, tr and td tags and their inner texts without anchor links.
EDIT: These are the exact expressions that I need. Is there anyway to make them a single pattern?
HTML = Regex.Replace(HTML, "(<(a|A)[^>]*>|</(a|A)>)", "")
HTML = Regex.Replace(HTML, "(<(tbody|TBODY)[^>]*>|</(tbody|TBODY)>)", "")
HTML = Regex.Replace(HTML, "(<(thead|THEAD)[^>]*>|</(thead|THEAD)>)", "")
HTML = Regex.Replace(HTML, "( .*=['""][^'""]+?['""])([^/>]*)(?=/?>|\s)", "")
HTML = Regex.Replace(HTML, "(<(span|SPAN))[^>]*?>.*?</((span|SPAN)>)", "")