Regex, How to extract a delimited string and containing some special words?

Question

From the following html script:

<p style="line-height:0;text-align:left">
    <font face="Arial">
        <span style="font-size:10pt;line-height:15px;">
            <br />
        </span>
    </font>
</p>
<p style="line-height:0;text-align:left">
    <font face="AR BLANCA">
        <span style="font-size:20pt;line-height:30px;">
            [designation]
        </span>
    </font>
</p>
<p style="line-height:0;text-align:left">
    &nbsp;&nbsp;
</p>

I want to extract the following part

<font face="AR BLANCA">
    <span style="font-size:20pt;line-height:30px;">
        [désignation]
    </span>
</font>

I tried this regular expression :

<font.*?font>

this could extract separatly two matches, but how to specify that I want that which contains [] ? Thank you

C#. I don't think I can find another way to solve my problem without regex — hsn_salhi, Sep 07 '15 at 01:15
@Casimir: I prefer stay on regex because my interaction with html script is limited so I don't think I need to embed a new API only for this purpose, thank you anyway — hsn_salhi, Sep 07 '15 at 22:15

score 0 · Answer 1 · answered Sep 07 '15 at 10:18

The way with Html Agility Pack:

using HtmlAgilityPack;
...

string htmlText = @"<p style=""line-height:0;text-align:left"">
...";

HtmlDocument html = new HtmlDocument();
html.LoadHtml(htmlText);
HtmlNode doc = html.DocumentNode;

HtmlNodeCollection nodes = doc.SelectNodes("//font[.//text()[contains(substring-after(., '['), ']')]]");

if (nodes != null)
{
    foreach (HtmlNode node in nodes)
    {
        Console.WriteLine(node.OuterHtml);
    }
}

score -2 · Accepted Answer · edited May 23 '17 at 11:44

In general, you shouldn't use regexes for HTML—there are generally many much better ways to do it. However, in some isolated cases, it works perfectly fine. Assuming this is one of those cases, here's how to do it with regex.

Making regexes is often easy when you think of it this way: write what you want to match, and then replace parts of it with regex as necessary.

We want to match

<font face="AR BLANCA">
    <span style="font-size:20pt;line-height:30px;">
        [désignation]
    </span>
</font>

We don't care what face="AR BLANCA"> <span style="font-size:20pt;line-height:30px;">, désignation, and </span> are, so replace them with .*.

<font .*[.*].*</font>

We also have to make sure that you escape all the special characters, otherwise [.*] will be mistaken for a character class.

<font .*\[.*\].*</font>

We also want to match all characters, but most of the time a . only matches non-newline characters. [\S\s] is a character class that by definition matches all characters.

<font [\S\s]*\[[\S\s]*\][\S\s]*</font>

We finally have one last problem—this regex will match from the very first <font to the last </font>. With your HTML example, making the quantifier lazy will not help it, so we need to do something else. The best way to do this that I know of is to use the trick explained here. So we replace each instance of [\S\s]* with ((?!</?font)[\S\s])*.

<font ((?!</?font)[\S\s])*\[((?!</?font)[\S\s])*\]((?!</?font)[\S\s])*</font>

Here's an online demonstration of this regex.

Worked perfectly. Thank you Mr Hat :) – hsn_salhi Sep 07 '15 at 01:49 — hsn_salhi, Sep 07 '15 at 01:49

Regex, How to extract a delimited string and containing some special words?

2 Answers2