0

I'm currently trying to extract latest visited domain information from a table that includes a visits history on my websites and information I want to retreive.

Here is my table structure:

CREATE TABLE `visit_record` (
    `id` INT NOT NULL AUTO_INCREMENT,
    `date` VARCHAR( 19 ) NOT NULL,
    `url` VARCHAR( 1024 ) NOT NULL,
    ... EXTRA COLUMNS/DATA ...
    PRIMARY KEY ( `id` )
);

Here URL stands for multiple pages on same or different domains.

But I want to get all columns for the latest row for each distinct domain, and my URL store a full query, not a domain only (for more flexibility regarding other features).

The problem is, my distinct value to group on is a calculated value. I tried mixing the solution from the post shown in examples I tried, with grouping / left joins / different structure of queries / subqueries, but I cannot manage and as soon as I want to add my other values/columns (for e.g ID of the row, extra data), it show 99% of the time the following error:

SELECT list contains nonaggregated column

(even if I found basic queries on internet looking like my-ones).

As I want to group by domain I'm first extracting a domain name as a column, in a query I'll use later with an alias:

SELECT
    `visit_record`.`id`,
    `visit_record`.`date`,
    SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
FROM
    `visit_record`;

Here everything is ok, and I see an additional column having my domain name without protocol and query details.

Now I would like to take the latest (MAX) row: the row having the latest date, and get only 1x per unique domain.

I already found part of the solution I guess, thanks to some topics such as the-ones below: Select info from table where row has max date SQL query to get the latest row

But I seems tricky / requiring modifications to fit my use case.

I'm not providing more examples because I made 15+ tries and I don't know which-one is the best to start from, but I guess you have enough information to help me on this!

Data sample of visit_record table:

id      date                    url                                 extra columns
_________________________________________________________________________________
1       2023-06-21 00:00:00     https://example1.com/en             ...
2       2023-06-20 23:00:00     https://example1.com/uk             ...
3       2023-06-20 14:00:00     https://example2.com/en/about       ...
4       2023-06-21 03:00:00     https://example2.com/fr             ...

My expected output:

id      date                    url                                 domain          extra columns
_________________________________________________________________________________________________
1       2023-06-21 00:00:00     https://example1.com/en             example1.com    ...
4       2023-06-21 03:00:00     https://example2.com/fr             example2.com    ...

Thank you in advance for your help.

Problem Reformulation

The main problem was performing this domain extraction from URL and grouping on it, as it is not a primary key, by keeping all columns of the row which is the "latest" (by date) one of each domain.

SOLUTION

Get for each domain which is not a primary key (extracted from an URL value), the latest-row by date (which is not a primary key too), including all the related row columns.

This query will add a last column "domain" and return the result:

SELECT
    t1.*
FROM
    (
        SELECT
            *,
            SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
        FROM 
            `visit_record`
   ) t1
INNER JOIN
    (
        SELECT
            domain,
            max( date ) lastdate
        FROM (
                SELECT
                    *,
                    SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
                FROM
                    `visit_record`
            ) t0
        GROUP BY domain
    ) t2 ON
        t1.domain = t2.domain
        AND t1.date = t2.lastdate;
Lucas
  • 13
  • 2
  • What values are in date, and how do you determine which is latest? better to use date or datetime types, not strings. – ysth Jun 21 '23 at 12:03
  • Please provide sample data and expected output. Also, define what does the "latest" means? Your `date` column is defined as `VARCHAR( 19 )`. – slaakso Jun 21 '23 at 12:04
  • Date is a "Y-m-d H:i:s" formatted date (stored as string) @ysth, and I cannot give more details about latest, I mean latest is latest: the most recent date registered. – Lucas Jun 21 '23 at 12:08
  • I added data sample and expected output, sorry for late @slaakso – Lucas Jun 21 '23 at 12:17
  • General discussion of efficient "groupwise-max" techniques: https://mysql.rjweb.org/doc.php/groupwise_max – Rick James Jun 22 '23 at 16:12

3 Answers3

1

Try this

select * 
from(
  select *,row_number()over(partition by domain order by date desc) rn
  from (
    SELECT *,
       SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( 
           SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 )
             , '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
    FROM  `visit_record`
   )t1
  )t2
where rn=1;

Result

id date url EXTRA_COLUMNS domain rn
1 2023-06-21 00:00:00 https://example1.com/en ... example1.com 1
4 2023-06-21 03:00:00 https://example2.com/fr ... example2.com 1

Updated:
if row_number() not available in your version of MySQL, use this example:

select * 
from (
    SELECT *,
       SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
    FROM  `visit_record`
   )t1 
inner join (
  select domain,max(date) lastdate
  from (
    SELECT *,
       SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( SUBSTRING_INDEX( `visit_record`.`url`, '/', 3 ), '://', -1 ), '/', 1 ), '?', 1 ) AS `domain`
    FROM  `visit_record`
   ) t0
  group by domain
  )t2 on t1.domain=t2.domain and t1.date=t2.lastdate
;

Example here

ValNik
  • 1,075
  • 1
  • 2
  • 6
  • Hey, just tested, but unfortunatelly, I get the following error: `#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '(partition by domain order by date desc) rn from ( SELECT *, SUB' at line 3` – Lucas Jun 21 '23 at 13:36
  • @Lucas: which version of MySQL are you running (`select version()` will tell you)? Window functions are available in MySQL 8.0 only. – GMB Jun 21 '23 at 15:54
  • Answer updated for earlier versions of MySql – ValNik Jun 21 '23 at 20:46
  • Amazing @ValNik ! this is working properly on my side, I just see 2x times the "domain" column, but I guess I can manage from here – Lucas Jun 22 '23 at 11:16
  • Thank you as well for letting the original / recent solution available, so I can use it in case of a migration or upgrade. – Lucas Jun 22 '23 at 11:17
1

(A long Comment on Lajos's Answer.)

select vr1.id, vr2.`date`, vr1.URL
from visit_record vr1
join (
    select url, max(rec.`date`) as `date`
    from visit_record rec
    GROUP BY url
) vr2  ON vr2.url = vr1.url
      AND vr2.date = vr1.date

This should get a row for each URL. But if there are dup dates, it can deliver multiple rows for a given URL.

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

You can join the actual table with an aliased table that finds the maximum date, order by id descendingly and take the first row only.

select vr1.id, vr2.`date`, vr1.URL
from visit_record vr1
join (
    select max(rec.`date`) as `date`
    from visit_record rec
) vr2
on 1=1
order by id desc
limit 0, 1

This should work properly with your format of Y-m-d H:i:s, although, I strongly recommend to use actual date columns rather than varchars that are logically dates.

Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175
  • Hey, thank you so much! This actually works, but the thing is this works for only 1x domain no? How can I get this result, but for each distinct domain of the list? – Lucas Jun 21 '23 at 13:40
  • I guess I can left join a table of distinct extracted domains an get this result for all of them, but there's maybe an easier / more efficient way. – Lucas Jun 21 '23 at 13:43
  • @Lucas you can group by domains. Before I edit my answer to solve this problem, can you tell me exactly what your distinct domains are? If you can tell me that, then I can help you further. Is each distinct `URL` value a separate domain? – Lajos Arpad Jun 21 '23 at 14:02
  • Hey, thanks for the offer of help, unfortunately there will be multiple domain I want to add later (I did not create yet) and from 0 to multiple pages, so the better option should be to group by domain without targeting a specific one. – Lucas Jun 22 '23 at 11:12
  • @Lucas there's nothing unfortunate in that. It is a resolvable problem. All that I need in order to help you with that is to know the exact definition of what you call a domain. If you could edit your question with a query that shows **exactly** how you would get the distinct domains, then the problem would be easy to solve for me. – Lajos Arpad Jun 22 '23 at 12:25
  • See my answer for fixing the 1x "bug". – Rick James Jun 22 '23 at 16:10
  • @RickJames thanks, that makes a lot of sense. I will not edit my answer to avoid plagiarism :) But I did upvote your answer. – Lajos Arpad Jun 22 '23 at 18:39