0

I have a piece of html code that I extracted:

Server Address</span></td><td    ><span  class="hpPageText" >hostname0403.domain.tld</span></td><

From this string, I am trying to extract the fqdn (hostname0403.domain.tld). I figured I would use the following logic:

  1. begins with >, ends with <
  2. must include at least 1 period (dot).
  3. must include either all numbers, all letters, or a combination of both.

What I am hoping to end up with is ">hostname0403.domain.tld<" and from there I can strip off the ><. This is the reg that I have so far, which works, but I don't think it is accurate:

$reg = ">[\w\.]+<"

I am very new to regex and while this does work, i'm not sure if it is fail safe. Any help would be appreciated.

Victor Zakharov
  • 25,801
  • 18
  • 85
  • 151
Progger
  • 2,266
  • 4
  • 27
  • 51
  • 2
    Just for the record: Parsing HTML with RegEx is not recommended. But to help you, it is important which parts of that HTML-Stuff change an which parts dont't. – DasKrümelmonster Jan 17 '13 at 16:10
  • 2
    To echo what @DasKrümelmonster said: See [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) as to why you shouldn't use regex to parse HTML in general. – Bobson Jan 17 '13 at 16:13
  • 2
    Mybe an overkill for your purpose but I would have a go at http://htmlagilitypack.codeplex.com/ – Jester Jan 17 '13 at 16:17

2 Answers2

1

The regex pattern needs some work. For instance, there could be whitespace before and/or after the hostname. And a hostname can contain '-' characters. You can handle the whitespace like so:

'>\s*(..hostname regex)\s*<'

For a better hostname regex, see this SO answer. Here's how you would modify that regex to suit your needs:

$str = 'Server Address</span></td><td    ><span  class="hpPageText" >hostname0403.domain.tld</span></td><'
$ValidHostnameRegex = ">\s*((?:(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*(?:[A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]))\s*<"
$str -match $ValidHostnameRegex
$matches[1]

Outputs:

hostname-0403.domain.tld
Community
  • 1
  • 1
Keith Hill
  • 194,368
  • 42
  • 353
  • 369
1

You can use the following (as a bonus, the Regex excludes the > and < for you):

        string source = @"Server Address</span></td><td    ><span  class=""hpPageText"" >hostname0403.domain.tld</span></td><";
        Regex r = new Regex(@"(?<=\>)(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])(?=\<)");

        string fqdn = "";
        Match fqdnMatch = r.Match(source);
        if (fqdnMatch.Success)
        {
            fqdn = fqdnMatch.Value;
        }
K Kimble
  • 249
  • 3
  • 8
  • thx. when i use your regex, i get back 4 matches: tld, domain, domain., and the fqdn. I would like it to return only 1 - the fqdn. – Progger Jan 17 '13 at 16:32
  • If you want to eliminate the groups, you could use this instead: "(?<=\>)(?:(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*(?:[A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])(?=\<)" It simply hides the groups. However, either way only one "Value" is created. – K Kimble Jan 17 '13 at 18:27