-2

From the following html script:

<p style="line-height:0;text-align:left">
    <font face="Arial">
        <span style="font-size:10pt;line-height:15px;">
            <br />
        </span>
    </font>
</p>
<p style="line-height:0;text-align:left">
    <font face="AR BLANCA">
        <span style="font-size:20pt;line-height:30px;">
            [designation]
        </span>
    </font>
</p>
<p style="line-height:0;text-align:left">
    &nbsp;&nbsp;
</p>

I want to extract the following part

<font face="AR BLANCA">
    <span style="font-size:20pt;line-height:30px;">
        [désignation]
    </span>
</font>

I tried this regular expression :

<font.*?font>

this could extract separatly two matches, but how to specify that I want that which contains [] ? Thank you

hsn_salhi
  • 1
  • 1

2 Answers2

0

The way with Html Agility Pack:

using HtmlAgilityPack;
...

string htmlText = @"<p style=""line-height:0;text-align:left"">
...";

HtmlDocument html = new HtmlDocument();
html.LoadHtml(htmlText);
HtmlNode doc = html.DocumentNode;

HtmlNodeCollection nodes = doc.SelectNodes("//font[.//text()[contains(substring-after(., '['), ']')]]");

if (nodes != null)
{
    foreach (HtmlNode node in nodes)
    {
        Console.WriteLine(node.OuterHtml);
    }
}
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
-2

In general, you shouldn't use regexes for HTML—there are generally many much better ways to do it. However, in some isolated cases, it works perfectly fine. Assuming this is one of those cases, here's how to do it with regex.


Making regexes is often easy when you think of it this way: write what you want to match, and then replace parts of it with regex as necessary.

We want to match

<font face="AR BLANCA">
    <span style="font-size:20pt;line-height:30px;">
        [désignation]
    </span>
</font>

We don't care what face="AR BLANCA"> <span style="font-size:20pt;line-height:30px;">, désignation, and </span> are, so replace them with .*.

<font .*[.*].*</font>

We also have to make sure that you escape all the special characters, otherwise [.*] will be mistaken for a character class.

<font .*\[.*\].*</font>

We also want to match all characters, but most of the time a . only matches non-newline characters. [\S\s] is a character class that by definition matches all characters.

<font [\S\s]*\[[\S\s]*\][\S\s]*</font>

We finally have one last problem—this regex will match from the very first <font to the last </font>. With your HTML example, making the quantifier lazy will not help it, so we need to do something else. The best way to do this that I know of is to use the trick explained here. So we replace each instance of [\S\s]* with ((?!</?font)[\S\s])*.

<font ((?!</?font)[\S\s])*\[((?!</?font)[\S\s])*\]((?!</?font)[\S\s])*</font>

Here's an online demonstration of this regex.

Community
  • 1
  • 1
The Guy with The Hat
  • 10,836
  • 8
  • 57
  • 75