C# Regex to parse HTML string and add ids into each header tag?

Question

I've got a CMS system and I need to do some auto formatting to the HTML strings before they get served up to the client. So in the database I may have an HTML string like this:

> "<h2>Example Header</h2><p>Here is some text about that
> header.</p><h2>Another Header 2</h2></p>Well I got more information
> here.</p>"

I want to add an ID attribute to every H2 tag that contains the text within the H2 tag with spaces removed, which will be used for anchor links. So the above example would be turned into:

> "<h2 id="ExampleHeader">Example Header</h2><p>Here is some text about that
> header.</p><h2 id="AnotherHeader2">Another Header 2</h2></p>Well I got more 
> information here.</p>"

So for every H2 in the string go from:

<h2>Header Example Text Right Here</h2>

To:

<h2 id="HeaderExampleTextRightHere">Header Example Text Right Here</h2>

Spaces removed but otherwise the exact same text. How can I do that with regex?

Don't use [regex to parse html](http://stackoverflow.com/a/1732454/1895201). [HtmlAgilityPack](http://htmlagilitypack.codeplex.com/) is perfect for this. — DGibbs, Mar 27 '14 at 16:29
While I appreciate the sentiment. I've been given specific instructions (which I disagree with) to not use additional libraries in this project due to it's legacy nature and some other malarkey. — SventoryMang, Mar 27 '14 at 16:36
Using regexes simply to parse HTML is bad enough; using regexes to parse HTML *and add ID attributes to elements* will be damaging to your mental health. — Chris, Mar 27 '14 at 16:44

score 2 · Answer 1 · answered Mar 27 '14 at 16:35

2

Is there any HTML processing library available in C#? Then please go with that. Regex can be handy to handle your example html. But for complex scenario, it will not prove safe.

Here is the regex/replace for your sample input. Remember, only for your sample input:

htmls = Regex.Replace(htmls, @"<h2>([^<]*)</h2>", "<h2 id=\"$1\">$1</h2>");

answered Mar 27 '14 at 16:35

Sabuj Hassan

38,281
14
75
85

Yes I understand. If there is HTML inside the h2 tag there could be problems yes? That won't happen for this case though. Could there be some other issues? – SventoryMang Mar 27 '14 at 16:38
You can't have spaces in HTML ID values, so your example of "Example Header" would end up being an illegal ID if used as-is. Additionally, IDs have to be unique, so if you have two or more h2 tags with identical content you're going to end up with two identical IDs, which is illegal. – Chris Mar 27 '14 at 16:43
@DOTang hopefully no other issues around. – Sabuj Hassan Mar 27 '14 at 16:44
@Chris Excellent point. My knowledge on HTML is poor. Thanks for the added information. – Sabuj Hassan Mar 27 '14 at 16:45
1

HTML 5 states: "The value must be unique amongst all the IDs in the element's home subtree and must contain at least one character. The value must not contain any space characters." – Chris Mar 27 '14 at 16:47
1

HTML 4 was more restrictive: "ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".")." – Chris Mar 27 '14 at 16:48
@SabujHassan Can you update your answer to reflect the removal of whitespace/spaces? – SventoryMang Mar 27 '14 at 16:49
@DOTang not sure what you mean, but if you want to have something like trimes, then use `
\s*([^<]*?)\s*
` remember to escape backslashes. – Sabuj Hassan Mar 27 '14 at 16:53
@SabujHassan That didn't seem to change anything. How can I remove any spaces from the $1 parameter? So if it's `
Hi There Friend
`, the new text becomes `
Hi There Friend
`. The inner text but without spaces. – SventoryMang Mar 27 '14 at 17:00
@DOTang look for any `regex/replace using callback for C#` From the callback you can remove the spaces and pack them in `h2` – Sabuj Hassan Mar 27 '14 at 17:11
@SabujHassan I'm not sure what you mean, I am horrible at regex which is why I posted this question. I tried putting the second parameter as `"
$1
"` but that broke it entirely. – SventoryMang Mar 27 '14 at 17:26
@DOTang I am not sure whether you can use `Trim()` over the `$1` or not. That's why I suggested to find something like `callback` in `C#` where you can use your own function. Right now I don't find any. – Sabuj Hassan Mar 27 '14 at 17:34

score 1 · Accepted Answer · answered Mar 27 '14 at 16:37

1

You can use this :

Regex.Replace("<h2>XYZ</h2>", "<h2>(?<innerText>[^<]*)</h2>", x => string.Format("<h2 id=\"{0}\">{0}</h2>", x.Groups["innerText"]))

answered Mar 27 '14 at 16:37

brz

5,926
1
18
18

Not quite, there was no removal of whitespace for the ID, but with this I was able to get it working: `Regex.Replace(htmls, @"
(?[^<]*)
", x => string.Format("
{1}
", x.Groups["innerText"].Value.Trim().Replace(" ", string.Empty), x.Groups["innerText"]));` – SventoryMang Mar 27 '14 at 18:15

C# Regex to parse HTML string and add ids into each header tag?

2 Answers2

\s([^<]?)\s*

Hi There Friend

Hi There Friend

$1

(?[^<]*)

{1}

C# Regex to parse HTML string and add ids into each header tag?

2 Answers2

\s*([^<]*?)\s*

Hi There Friend

Hi There Friend

$1

(?[^<]*)

{1}

\s([^<]?)\s*