2

I am attempting to do statistical tracking. In my database I am storing referring urls. Frequently I have url's that resemble the following:

http://www2.trafficadbar.com/__a4w4
http://trafficadbar.com/__a4w4
http://www.trafficadbar.com/__a4w4
http://4acesmailer.com/credit_click.php?userid=2472&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=2714&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=2723&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=3245&openkey=gbyp2vcm
http://4acesmailer.com/credit_click.php?userid=3259&openkey=gbyp2vcm

I want to know how I would do a GROUP BY and COUNT on a regex pattern. Basically what I want is as follows returned:

trafficadbar 3
4acesmailer 5 

Currently when I try to do a GROUP BY it only works where the url's are exactly the same. so www.blah.com and blah.com are two different results and further each url variable ?blah=1&blahblah=2 acts as yet anoher unique group,

I have searched for countless solutions, but they mostly seem to be very specific to the problem asked, and almost all seem to show some "non-regex" workaround - which would be fine... if I could find a method I could apply.

Your Common Sense
  • 156,878
  • 40
  • 214
  • 345
Bruce
  • 1,039
  • 1
  • 9
  • 31

5 Answers5

1

To retrieve the part immediately preceding the top level domain from the hostnames, you could work along:

SELECT
  REVERSE(SUBSTRING(SUBSTRING_INDEX(rev_hostname, '.', 2),
          LOCATE('.', rev_hostname) + 1)
         ) domain
  , COUNT(id) hits
FROM (
  SELECT
    id
    , CONCAT(REVERSE(SUBSTRING_INDEX(SUBSTRING(referring_site, 8),
                                     '/', 1)), '.') rev_hostname
  FROM TestData
  ) T
GROUP BY domain
;

It:

  • relies on all referring_sites to start off with http://, and
  • will fail - as it stands - for, e.g., 4acesmailer.co.uk.

Either one could be addressed (to some degree) if required.

See it in action SQL Fiddle (with your data somewhat adjusted/extended to cover some more cases).

Please comment if and as this requires adjustment / further detail.

Abecee
  • 2,365
  • 2
  • 12
  • 20
0

If you just care about those 2 values, something like this would work:

select case when yourcolumn like '%trafficadbar%' then 'trafficadbar' 
            when yourcolumn like '%4acesmailer%' then '4acesmailer' 
       end,
    count(*)
from yourtable 
group by 1

Edit, given your comments, this could be a little more dynamic and relatively easy to extend:

select 
  replace(replace(replace(
     left(yourcolumn, locate('.com', yourcolumn) - 1), 
     'http://', ''), 
     'www.', ''), 
     'www2.', ''),
  count(*)
from yourtable 
group by 1
sgeddes
  • 62,311
  • 6
  • 61
  • 83
  • 1
    I'm pretty sure those are examples and they don't know all of the domain names up front. – AbraCadaver Jul 23 '16 at 00:33
  • @AbraCadaver -- Good point, could be, not completely clear if the OP is searching for specific domains or wants to somehow aggregate similar ones. Don't think the later is possible with sql alone... – sgeddes Jul 23 '16 at 00:37
  • That word "resembling" in my post, definitely was intended to say I am not specifically searching for those - that would be easy :) – Bruce Jul 23 '16 at 01:00
  • @Bruce -- fair enough, this isn't trivial with `sql` alone. See edits though for a potential "working" solution... – sgeddes Jul 23 '16 at 01:18
  • @sgeddes this [answer](http://stackoverflow.com/a/20046959) perhaps? idk – Drew Jul 23 '16 at 02:22
0

I'm not skilled enough to do this reliably all in SQL; there are too many moving parts: lots of subdomains, lots of TLDs possible (not just .com), malformed domains possible etc...

My approach: Select everything and parse in PHP.

In the example below, I assume the URLs are in a urls column, and that you have a date_added column with the datetime when each url was added to the DB. Adjust your query accordingly.

Select all urls added within the last 30 days:

SELECT `urls` FROM `myTable`
WHERE `date_added` >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)

Place all results in a $rows array, then process that to generate the report you want

$rows = [...];//Associative array of all rows returned by the query above
$results = []; //will hold aggregate counts

foreach($rows as $row){   
    $host = parse_url($row['urls'],PHP_URL_HOST); //eg: www2.trafficadbar.com
    $matches = [];

    //find top level domain or skip to next row
    if(!preg_match('/[^\.]*\.[^\.]+$/',$host,$matches)) continue;        

    $domain = $matches[0]; //eg: trafficadbar.com

    //increment the count for this domain in results
    if(!isset($results[$domain])) $results[$domain]=0;
    $results[$domain]++;
}

With the input you gave in the OP, the $results will be:

[
    'trafficadbar.com' => 3,
    '4acesmailer.com' => 5,
]

You'll notice that unlike you, I kept the TLD (eg: .com, .net...) because ebay.com and ebay.ph are completely different domains. I would advise against mashing them into one result.

Live demo

BeetleJuice
  • 39,516
  • 19
  • 105
  • 165
  • Good one. But just add to array in loop and then use `array_count values()` – AbraCadaver Jul 23 '16 at 02:54
  • I don't think that would be faster because you will be traversing arrays twice (first all urls to add domains, then all results to count values) when I only do it once. – BeetleJuice Jul 23 '16 at 02:59
0

Although the solution from @BeetleJuice would have worked, and likely more reliably than the solution I chose, I opted for an SQL solution

SELECT 
   CASE WHEN SUBSTRING(referring_site, 1, 8) = 'http://w' 
      THEN SUBSTRING_INDEX((SUBSTRING_INDEX(referring_site, '.', 2)), '.', -1)
      ELSE SUBSTRING_INDEX((SUBSTRING_INDEX(referring_site, '.', 1)), '://', -1) 
   END 
AS domain 
FROM 
....

Drawbacks are when it doesn't star with a http://w but rather some http://random.sub.domain

Bruce
  • 1,039
  • 1
  • 9
  • 31
  • If anyone has a better pure mysql solution, that is what I would really like. Ultimately what I want to do is is count the number of "." between :// and / then if 2 do the first substring_index else do the second. I just don't know how to get and apply a proper count on the "." – Bruce Jul 23 '16 at 17:42
  • (i) Is it correct to re-phrase: You want the part preceding the top level domain (**com**, **org**, etc.)? (Judging by the number of dots might be misleading: What would you like to get from **abc.def.ghi.com**?) (ii) Always starting **http**? – Abecee Jul 26 '16 at 19:34
0

Similar to: https://stackoverflow.com/a/72834976/7768504

MySQL introduced REGEXP_SUBSTR to group columns by applying regular expressions. Documentation for REGEXP_SUBSTR

REGEXP_SUBSTR(<column_name>, <regular_expression>, <starting_position>, <match_occurrence>)

Yashashvi
  • 340
  • 3
  • 12