-2

I have a pages table of URLs and the categories they're associated with. I'm joining it with ITSELF using a constraints table and using GROUP BY to get unique URLs, then sorting by highest score.

Problem: The highest score of the group of URLs isn't always selected.

(Background: In production, this will be used to know which pages in the 'from' category should hyperlink to pages in the 'to' category)

I think there is something in this answer, but I can't figure out how to adapt it:

Current Query

SELECT keyword, URL, score FROM
    (   
        SELECT keyword, URL, score
        FROM pages
        JOIN constraints
        ON pages.category = constraints.to
        AND constraints.from IN (SELECT category FROM pages WHERE URL = 'https://www.example.net')
        ORDER BY score DESC
    )   
AS x        
GROUP BY URL;

pages

+---------+-------------------------+----------+-------+
| keyword | URL                     | category | score |
+---------+-------------------------+----------+-------+
| Cat     | https://www.example.org | 1        | 100   |
+---------+-------------------------+----------+-------+
| Dog     | https://www.example.com | 2        | 50    |
+---------+-------------------------+----------+-------+
| Fish    | https://www.example.com | 2        | 60    |
+---------+-------------------------+----------+-------+
| Mouse   | https://www.example.net | 3        | 1     |
+---------+-------------------------+----------+-------+

constraints

+------+----+
| from | to |
+------+----+
| 1    | 2  |
+------+----+
| 2    | 1  |
+------+----+
| 3    | 2  |
+------+----+

Current output:

+---------+-------------------------+-------+
| keyword | URL                     | score |
+---------+-------------------------+-------+
| Dog     | https://www.example.com | 50    |
+---------+-------------------------+-------+

Dog row selected, despite having a lower score than the Fish row.

Desired output:

+---------+-------------------------+-------+
| keyword | URL                     | score |
+---------+-------------------------+-------+
| Fish    | https://www.example.com | 60    |
+---------+-------------------------+-------+

Edit: Reduced tables to a minimal reproducible example. Added current output. And explained things a little better.

Jon
  • 452
  • 4
  • 23

1 Answers1

1

Based on your data, I tried working out the problem. My logic is as follows :

  1. I started building a denormalized table, which is table x, which contains information on page traversal, i.e from pages & to pages also with respective URLs.
  2. In table y, using dense_rank() function can help in ranking your results.
  3. From table y, only rank 1 records have to be picked & to remove duplicates group by is used.

my query is as follows :

select to_keyword as Keyword, to_url as URL, to_score as Score 
    from (
     select from_url, to_keyword, to_url, to_score, dense_rank() over (partition by to_url order by to_score desc) as rnk 
       from ( select p.url as from_url, u.Keyword as to_keyword, u.url as to_url, u.Category as to_category, u.Score as to_score 
              from pages p inner join constrains c on p.Category = c.from inner join pages u on u.Category = c.to
             )x
         ) y 
    where rnk = 1 and from_url = 'https://www.b.com/' 
    group by 1, 2, 3
    order by to_url
Anand Vidvat
  • 977
  • 7
  • 20