C# string split regex

Question

I have a string in the form of html code like

<head><p> this is the header</p></head> <body>..... </body>

I want to split this string such that I only get <head><p> and the tags. Is there a way to do this in C# using regex?

Parsing xml/html should be done with the appropriate tools, not regexes. Regexes cannot parse xml/html... — Willem Van Onsem, Jan 04 '16 at 23:19
Probably a job for a HTML parser. You you please be more precise about the result you desire? Give an example input and output. — timgeb, Jan 04 '16 at 23:19
[You can't parse HTML with regex](http://stackoverflow.com/a/1732454) — Ian, Jan 04 '16 at 23:31
While it is not **recommended** to parse HTML/XML with regex, the question was whether or not it was "possible", not what the different religions of coding believe on the matter. YES, it is possible and noted in my answer, however, strongly discouraged by the community at large. unless you have a seriously good reason for needing to use regex, you can use MUCH better, and MUCH safer techniques like **XElement** which is native .NET class. — spencerwjensen, Jan 04 '16 at 23:37

spencerwjensen · Answer 1 · 2016-01-04T23:42:48.400

I assume when you say "only get <head><p> and the tags" that you mean you want to identify ALL tag elements in the entire string, including the closing tags and the <body> tags, etc...?

In any case, the answer is YES, you can do this in C#. There are many good XML/HTML parsers that you can look into, but if you are specifically trying to get an array or list of all of the tag elements and are determined to use regex, then you could use something like Regex.Split(input, pattern). https://msdn.microsoft.com/en-us/library/8yttk7sy(v=vs.110).aspx

Basically you'll want to setup your pattern and make sure to escape any of the XML characters:

string pattern = "\<.+\>" (note this may not be the exact regex pattern you want)

Then just do something like this:

string[] tags = Regex.Split(htmlString, pattern);

--UPDATE--

Because of how strongly many people feel about using regex to parse XML/HTML, I thought I should update with an additional comment about an alternative.

If you truly want to get a list, or an array of the tag elements in the string, you could use something like the XElement class. simply create a new XElement object from the string you want, and then you can do all kinds of neat stuff, including iterating over the tags, and nested tags if needed, to create a list, or array. XElement may not be exactly what you want for an HTML string, but it gives you an idea of the possibilities without needing regex. Cheers!

If you are going to give my answer a negative vote, I'd appreciate a comment explaining why. Is there something wrong with the answer that I can fix? If it's purely because you detest using regex to parse XML/HTML, then that's not really a valid reason to down vote. I am answering THE question at hand, not trying to solve world hunger. — spencerwjensen, Jan 05 '16 at 16:24

C# string split regex

1 Answers1