Need to accurately write RegEx query

Question

I have a piece of html code that I extracted:

Server Address</span></td><td    ><span  class="hpPageText" >hostname0403.domain.tld</span></td><

From this string, I am trying to extract the fqdn (hostname0403.domain.tld). I figured I would use the following logic:

begins with >, ends with <
must include at least 1 period (dot).
must include either all numbers, all letters, or a combination of both.

What I am hoping to end up with is ">hostname0403.domain.tld<" and from there I can strip off the ><. This is the reg that I have so far, which works, but I don't think it is accurate:

$reg = ">[\w\.]+<"

I am very new to regex and while this does work, i'm not sure if it is fail safe. Any help would be appreciated.

Just for the record: Parsing HTML with RegEx is not recommended. But to help you, it is important which parts of that HTML-Stuff change an which parts dont't. — DasKrümelmonster, Jan 17 '13 at 16:10
To echo what @DasKrümelmonster said: See [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) as to why you shouldn't use regex to parse HTML in general. — Bobson, Jan 17 '13 at 16:13
Mybe an overkill for your purpose but I would have a go at http://htmlagilitypack.codeplex.com/ — Jester, Jan 17 '13 at 16:17

score 1 · Answer 1 · edited May 23 '17 at 12:27

The regex pattern needs some work. For instance, there could be whitespace before and/or after the hostname. And a hostname can contain '-' characters. You can handle the whitespace like so:

'>\s*(..hostname regex)\s*<'

For a better hostname regex, see this SO answer. Here's how you would modify that regex to suit your needs:

$str = 'Server Address</span></td><td    ><span  class="hpPageText" >hostname0403.domain.tld</span></td><'
$ValidHostnameRegex = ">\s*((?:(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*(?:[A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]))\s*<"
$str -match $ValidHostnameRegex
$matches[1]

Outputs:

hostname-0403.domain.tld

score 1 · Answer 2 · answered Jan 17 '13 at 16:25

1

You can use the following (as a bonus, the Regex excludes the > and < for you):

        string source = @"Server Address</span></td><td    ><span  class=""hpPageText"" >hostname0403.domain.tld</span></td><";
        Regex r = new Regex(@"(?<=\>)(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])(?=\<)");

        string fqdn = "";
        Match fqdnMatch = r.Match(source);
        if (fqdnMatch.Success)
        {
            fqdn = fqdnMatch.Value;
        }

answered Jan 17 '13 at 16:25

K Kimble

249
3
8

thx. when i use your regex, i get back 4 matches: tld, domain, domain., and the fqdn. I would like it to return only 1 - the fqdn. – Progger Jan 17 '13 at 16:32
If you want to eliminate the groups, you could use this instead: "(?<=\>)(?:(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*(?:[A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])(?=\<)" It simply hides the groups. However, either way only one "Value" is created. – K Kimble Jan 17 '13 at 18:27

Need to accurately write RegEx query

2 Answers2