0

I need a function:

f(fqdn,suffix) -> basedomain

with these example inputs and outputs:

f('foobar.quux.somedomain.com','com') -> 'somedomain.com'
f('somedomain.com','com') -> 'somedomain.com'
f('foobar.quux.somedomain.com.br','com.br') -> 'somedomain.com.br'
f('somedomain.com.br','com.br') -> 'somedomain.com.br'

In plain English, if the suffix has n segments, take the last n+1 segments. Find the base domain for the FQDN, allowing for the fact that some FQDNs have more than one suffix element.

The suffixes I need to match are here. I've already got them in my SQL database.

I could write this in C#; it might not be the most elegant but it would work. Unfortunately I would like to have this function in either T-SQL, where it is closest to the data, or in Powershell, which is where the rest of the utility that consumes this data is going to be. I suppose it would be ok to do it in C#, compile to an assembly and then access it from T-SQL, or even from Powershell ... if that would be the fastest executing. If there's some reasonably clever alternative in pure T-SQL or simple Powershell, I'd like that.

EDIT: One thing I forgot to mention explicitly (but which is clear when reviewing the suffix list, at my link above) is that we must pick the longest matching suffix. Both "br" and "com.br" appear in the suffix list (with similar things happening for uk, pt, etc). So the SQL has to use a window function to make sure the longest matching suffix is found.

Here is how far I got when I was doing the SQL. I had gotten lost in all the substring/reverse functions.

SELECT Domain, suffix
FROM (
    SELECT SD.Domain, SL.suffix, 
       RN=ROW_NUMBER() OVER (
           PARTITION BY sd.Domain ORDER BY LEN(SL.suffix) DESC)
    FROM SiteDomains SD
    INNER JOIN suffixlist SL ON SD.Domain LIKE '%.'+SL.suffix
) AS X
WHERE RN=1

This works ok for finding the right suffix. I'm a little concerned about its performance though.

Ross Presser
  • 6,027
  • 1
  • 34
  • 66
  • I assume you would be using .NET string manipulation in C# to accomplish the task right? if so you can just do the exact same things in powershell (the string handling is the same and all the methods available in C# are also available in Powershell) – Mike Garuccio Dec 19 '16 at 18:52
  • `$fqdn -replace ".*?(?=[^.]+\.$suffix)"` – TessellatingHeckler Dec 19 '16 at 19:36
  • @TessellatingHeckler **wow**. That's a mighty brief regex solution. Unfortunately it is not proof against something like foo.bar.com.br.something.com.br or even something like my.computers.com – Ross Presser Dec 19 '16 at 19:42
  • Instead of relying on regex and string manipulation I think there are better methods for this. http://stackoverflow.com/questions/10735190/parsing-string-for-domain-hostname This should translate to PowerShell easily as well with `[system.uri]` – Matt Dec 19 '16 at 19:59
  • Also related: http://stackoverflow.com/questions/14211973/get-host-domain-from-url – Matt Dec 19 '16 at 20:05
  • Darn, I should have anchored it more; `$fqdn -replace "^.*?(?=[^.]+\.$suffix\`$)"` is proof against foo.bar.com.br.something.com.br ... – TessellatingHeckler Dec 19 '16 at 20:40
  • @Matt: The accepted answer on the question you linked does NOT do anything whatsoever to get the base domain. – Ross Presser Dec 20 '16 at 21:16
  • @RossPresser My mistake I thought the URI class would help with the parsing. I figured there had to be something more than string parsing for this – Matt Dec 20 '16 at 23:22

3 Answers3

2

The following demonstrates matching FQDNs with TLDs and extracting the desired n + 1 domain name segments:

-- Sample data.
declare @SampleTLDs as Table ( TLD VarChar(64) );
insert into @SampleTLDs ( TLD ) values
  ( 'com' ), ( 'somedomain.com' ), ( 'com.br' );
declare @SampleFQDNs as Table ( FQDN VarChar(64) );
insert into @SampleFQDNs ( FQDN ) values
  ( 'foobar.quux.somedomain.com' ), ( 'somedomain.com' ),
  ( 'foobar.quux.somedomain.com.br' ), ( 'somedomain.com.br' );
select * from @SampleTLDs;
select * from @SampleFQDNs;

-- Fiddle about.
select FQDN, TLD,
  case
    when DotPosition = 0 then FQDN
    else Reverse( Left( ReversedPrefix, DotPosition - 1) ) + '.' + TLD
    end as Result
  from (
    select FQDNs.FQDN, TLDs.TLD,
      Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) as ReversedPrefix,
      CharIndex( '.', Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) ) as DotPosition
      from @SampleFQDNs as FQDNs inner join
        @SampleTLDs as TLDs on FQDNs.FQDN like '%.' + TLDs.TLD or FQDNs.FQDN = TLDs.TLD  ) as Edna;

-- To select only the longest matching TLD for each FQDN:
with
  ExtendedFQDNs as (
    select FQDNs.FQDN, TLDs.TLD, Row_Number() over ( partition by FQDN order by Len( TLDs.TLD ) desc ) as TLDLenRank,
      Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) as ReversedPrefix,
      CharIndex( '.', Substring( Reverse( FQDNs.FQDN ), Len( TLDs.TLD ) + 2, 100 ) ) as DotPosition
      from @SampleFQDNs as FQDNs inner join
        @SampleTLDs as TLDs on FQDNs.FQDN like '%.' + TLDs.TLD or FQDNs.FQDN = TLDs.TLD )
  select FQDN, TLD,
    case
      when DotPosition = 0 then FQDN
      else Reverse( Left( ReversedPrefix, DotPosition - 1) ) + '.' + TLD
      end as Result
    from ExtendedFQDNs
    where TLDLenRank = 1;
HABO
  • 15,314
  • 5
  • 39
  • 57
  • This is good, but needs a refinement that I didn't mention explicitly in the question (but it is implicit if you look at the suffix list link). Both "br" and "com.br" appear in the suffix list; the longest suffix must be preferred. I'll add this to my question. – Ross Presser Dec 20 '16 at 18:10
  • This works extremely well when paired with the [left self join](http://stackoverflow.com/a/8749095/864696) technique for finding the longest matching suffix. If you update your answer for that I'll accept it. – Ross Presser Dec 20 '16 at 18:32
  • 1
    @RossPresser The answer has been updated: (1) corrected the sample date, (b) corrected handling of FQDN = TLD case, (iii) added example that picks the "best", i.e. longest TLD, match for each FQDN. – HABO Dec 20 '16 at 20:15
1

Here's how I would do it in C#:

string getBaseDomain(string fqdn, string suffix)
{
    string[] domainSegs = fqdn.Split('.');
    return domainSegs[domainSegs.Length - suffix.Split('.').Length - 1] + "." + suffix;
}

So here it is in Powershell:

function getBaseDomain
{
  Param(
    [string]$fqdn, 
    [string]$suffix 
  )
  $domainSegs = $fqdn.Split(".");
  return $domainSegs[$domainSegs.Length - $suffix.Split(".").Length - 1] + "."+$suffix;
}

Seems rather silly now to have wasted stackoverflow.com's time with this. My apologies.

Ross Presser
  • 6,027
  • 1
  • 34
  • 66
1

Here is a tsql variant...

declare @fqdn varchar(256) = 'somedomain.com'
declare @suffix varchar(128) = 'com'

select left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2)

if(select CHARINDEX('.',reverse(left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2)))) = 0
    begin
    select left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2) + '.' + @suffix
    end
else
    begin
    select right(left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2),CHARINDEX('.',reverse(left(@fqdn,CHARINDEX(@suffix,@fqdn) - 2))) - 1) + '.' + @suffix
    end
S3S
  • 24,809
  • 5
  • 26
  • 45