Regex expression C# for HTML

Question

I have following regex:

^(<span style=.*?font-weight:bold.*?>.*?</span>)

It matches the following code:

<span style="font-family:Arial; font-size:10pt"> r.</span></p><p style="margin:0pt"><span style="font-family:Arial; font-size:10pt; font-weight:bold">&#xa0;</span>

But I would like to match only this part (last span containing font-weight:bold style)

<span style="font-family:Arial; font-size:10pt; font-weight:bold">&#xa0;</span>

[You can't parse XHTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML...](http://stackoverflow.com/a/1732454/1185053) — dav_i, Jul 30 '13 at 13:54
Do not try to parse HTML with regular expressions. Go get the [Html Agility Pack](http://htmlagilitypack.codeplex.com/). — Jim Mischel, Jul 30 '13 at 13:55
Guys! Kamil didn't ask whether parsing HTML using Regex is a good idea. He asked a nice and specific question about how to have his regex match a different part of the provided string. The fact that his string happens to look like HTML is completely irrelevant for this question. No need for the HTML-Regex-kneejerk-reflex... — Mels, Jul 30 '13 at 13:57
@Mels - No, Kamil is about to shoot himself in the foot and various other body parts. We cannot, through inaction, allow a human being to come to harm. — Corak, Jul 30 '13 at 14:07
@Mels The fact that his string happens to look like HTML is completely *relevant* as it shines light on the classic XY problem happening here. The OP is asking how to make his "solution" work, when he's clearly using the wrong tools for the job. When he comes back an hour later with another question about matching something else, it'll only add to the pollution on SO. — Dan Lugg, Jul 30 '13 at 14:11
Hehe, lol, I agree. A small comment about it perhaps not being the best of ideas and a pointer in the right direction _would_ be cool and helpful. But 4 (!) separate comments spaced minutes apart saying pretty much the same thing is just a bit much... There _are_ valid cases (although extremely rare) where using a regex will be a better fit than an HTML parser. — Mels, Jul 30 '13 at 14:11

score 7 · Accepted Answer · edited Nov 24 '17 at 16:51

7

Use HTML Agility Pack to parse html:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

var boldSpans = from s in doc.DocumentNode.SelectNodes("//span")
                let style = s.Attributes["style"].Value
                where style.Contains("font-weight:bold")
                select s;

Or even better xpath, which selects all nodes in one line:

doc.DocumentNode.SelectNodes("//span[contains(@style, 'font-weight:bold')]")

edited Nov 24 '17 at 16:51

carla

1,970
1
31
44

answered Jul 30 '13 at 13:59

Sergey Berezovskiy

232,247
41
429
459

1

I actually prefer the first - it's easier to read in my opinion. – dav_i Jul 30 '13 at 14:07
@dav_i that's why I leaved both options :) – Sergey Berezovskiy Jul 30 '13 at 14:08
2

Thanks!! I have HTML generated by external library so I assumed that the structure (way of creation) of HTML will be constans. Anyway HTML Agility Pack is better option :) – Kamil Jul 30 '13 at 14:17

Robert Fricke · Answer 2 · 2013-07-30T14:09:41.443

Don't use ^ since the line doesn't start with the span you want to match.

<span style=["'][^'"]*font-weight:bold[^'"]*['"]>[^<]*</span>

Or as escaped string:

"<span style=[\"'][^'\"]*font-weight:bold[^'\"]*['\"]>[^<]*</span>"

This matches strings starting with <span style= followed by single or double quote ', ". Then [^'"]* allows all characters except ending quotes.

Match string font-weight:bold, followed again by any amount of characters except ending qoutes leading up to the real ending qoutes and ending tag: [^'"]*['"]>.

(Note that you might or might not want to allow more attributes before and after the style attribute. In that case you need to alter the regex)

span may contain any amount of any characters except start tag <, then string has to end with closing </span> tag.

score 0 · Answer 3 · answered Jul 30 '13 at 13:54

remove the ^, because it means beginning of the line. Therefore it will always get the first span. More so because .* means (any characters at all).

doing this the first match may stil be the output you have now, but the second match should be what you're after.

Furthermore tools like regexbuddy and such are good for testing Regex's.

Regex expression C# for HTML

3 Answers3