1

I have following regex:

^(<span style=.*?font-weight:bold.*?>.*?</span>)

It matches the following code:

<span style="font-family:Arial; font-size:10pt"> r.</span></p><p style="margin:0pt"><span style="font-family:Arial; font-size:10pt; font-weight:bold">&#xa0;</span>

But I would like to match only this part (last span containing font-weight:bold style)

<span style="font-family:Arial; font-size:10pt; font-weight:bold">&#xa0;</span>
HamZa
  • 14,671
  • 11
  • 54
  • 75
Kamil
  • 149
  • 1
  • 2
  • 10
  • I think you should look for an HTML parser. – HamZa Jul 30 '13 at 13:49
  • 1
    [You can't parse XHTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML...](http://stackoverflow.com/a/1732454/1185053) – dav_i Jul 30 '13 at 13:54
  • 1
    Do not try to parse HTML with regular expressions. Go get the [Html Agility Pack](http://htmlagilitypack.codeplex.com/). – Jim Mischel Jul 30 '13 at 13:55
  • 2
    Guys! Kamil didn't ask whether parsing HTML using Regex is a good idea. He asked a nice and specific question about how to have his regex match a different part of the provided string. The fact that his string happens to look like HTML is completely irrelevant for this question. No need for the HTML-Regex-kneejerk-reflex... – Mels Jul 30 '13 at 13:57
  • 3
    @Mels - No, Kamil is about to shoot himself in the foot and various other body parts. We cannot, through inaction, allow a human being to come to harm. – Corak Jul 30 '13 at 14:07
  • 1
    @Mels The fact that his string happens to look like HTML is completely *relevant* as it shines light on the classic XY problem happening here. The OP is asking how to make his "solution" work, when he's clearly using the wrong tools for the job. When he comes back an hour later with another question about matching something else, it'll only add to the pollution on SO. – Dan Lugg Jul 30 '13 at 14:11
  • 1
    Hehe, lol, I agree. A small comment about it perhaps not being the best of ideas and a pointer in the right direction _would_ be cool and helpful. But 4 (!) separate comments spaced minutes apart saying pretty much the same thing is just a bit much... There _are_ valid cases (although extremely rare) where using a regex will be a better fit than an HTML parser. – Mels Jul 30 '13 at 14:11

3 Answers3

7

Use HTML Agility Pack to parse html:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);

var boldSpans = from s in doc.DocumentNode.SelectNodes("//span")
                let style = s.Attributes["style"].Value
                where style.Contains("font-weight:bold")
                select s;

Or even better xpath, which selects all nodes in one line:

doc.DocumentNode.SelectNodes("//span[contains(@style, 'font-weight:bold')]")
carla
  • 1,970
  • 1
  • 31
  • 44
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
1

Don't use ^ since the line doesn't start with the span you want to match.

<span style=["'][^'"]*font-weight:bold[^'"]*['"]>[^<]*</span>

Or as escaped string:

"<span style=[\"'][^'\"]*font-weight:bold[^'\"]*['\"]>[^<]*</span>"

This matches strings starting with <span style= followed by single or double quote ', ". Then [^'"]* allows all characters except ending quotes.

Match string font-weight:bold, followed again by any amount of characters except ending qoutes leading up to the real ending qoutes and ending tag: [^'"]*['"]>.

(Note that you might or might not want to allow more attributes before and after the style attribute. In that case you need to alter the regex)

span may contain any amount of any characters except start tag <, then string has to end with closing </span> tag.

Robert Fricke
  • 3,637
  • 21
  • 34
0

remove the ^, because it means beginning of the line. Therefore it will always get the first span. More so because .* means (any characters at all).

doing this the first match may stil be the output you have now, but the second match should be what you're after.

Furthermore tools like regexbuddy and such are good for testing Regex's.