2

I am trying to fetch a hierarchical data in my database. Here is my initial code.

        SELECT TOP 5
            year, location_state, location_city,
            COUNT(tf.customer_key) Number_of_Customers
        FROM TransactionFact tf
        JOIN LocationDim as ld
            ON ld.location_key = tf.seller_location_key
        JOIN DateDim dd
            ON dd.date_key = tf.order_date_key
        WHERE dd.year = 2016 and location_state = 'SP'
        GROUP BY dd.year, ld.location_state, ld.location_city
        ORDER BY dd.year DESC, Number_of_Customers DESC

And here is the result, result.

Basically, in the query, what I want to do is to not hard code the location_state in the WHERE clause. I want to make it dynamic so that what I get are the top 5 cities in each state.

Here are the column names for the LocationDim table

location_key
location_zip_code_prefix
location_state
location_city

EDITED: What I need is something like this.

+------+----------------+---------------+---------------------+
| year | location_state | location_city | Number_of_Customers |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_1     |    city_1     |       100           |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_1     |    city_2     |       90            |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_1     |    city_3     |       89            |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_1     |    city_4     |       88            |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_1     |    city_5     |       20            |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_2     |    city_1     |       100           |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_2     |    city_2     |       45            |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_2     |    city_3     |       23            |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_2     |    city_4     |       10            |
+------+----------------+---------------+---------------------+
| 2016 |    STATE_2     |    city_5     |       5             |
+------+----------------+---------------+---------------------+



PS: Sorry, this is my first question in stackoverflow. If this question is duplicated, pls drop the link and I'll give it a go. Thank u in advance.

Seven
  • 330
  • 2
  • 15
  • Does this answer your question? [Get top 1 row of each group](https://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group) – Charlieface Dec 10 '21 at 09:26
  • @Charlieface, I think I'll have a hard time reading that one because I don't know how CTE works. I still need to figure that out. I am hoping there's an easier way. Thank you – Seven Dec 10 '21 at 09:45
  • The standard answer is pretty much identical to @LukStorms answer: just add a partitioned row-number and filter by that. You place your whole query into a derived table with the row-number then filter outside of that (because you can't filter `ROW_NUMBER` from the same level). Side point: `COUNT(tf.customer_key)` is the same as `COUNT(*)`, it just counts non-null rows. If you wanted the count of distinct `customer_key` you would need `COUNT(DISTINCT tf.customer_key)` – Charlieface Dec 10 '21 at 09:48

4 Answers4

1

You can add rownumbers using ROW_NUMBER() OVER (PARTITION BY ORDER BY )

The query below partition records by location_state and add rownumber with number_of_customer order :

Select * from 
inn.*,  ROW_NUMBER() OVER (PARTITION BY location_state ORDER BY Number_of_Customers DESC) AS rn
( 
SELECT year,
       location_state,
       location_city,
       COUNT(tf.customer_key) Number_of_Customers
  FROM TransactionFact tf
  JOIN LocationDim as ld
    ON ld.location_key = tf.seller_location_key
  JOIN DateDim dd
    ON dd.date_key = tf.order_date_key
 WHERE dd.year = 2016
 GROUP BY dd.year, ld.location_state, ld.location_city
 ) inn

After this , you can easily filter this and select top5 or any other order ...

Note: I used your query as inner query. I didnt have a chance to test it since there is no fiddle

Ali Fidanli
  • 1,342
  • 8
  • 12
1

If you calculate a row_number then you can filter on that.

SELECT *
FROM 
(
    SELECT 
      dd.[year]
    , ld.location_state
    , ld.location_city
    , COUNT(tf.customer_key) AS total_customers
    , rn = ROW_NUMBER() OVER (PARTITION BY ld.location_state, dd.[year] 
                              ORDER BY COUNT(tf.customer_key) DESC)
    FROM TransactionFact AS tf
    JOIN LocationDim AS ld
      ON ld.location_key = tf.seller_location_key
    JOIN DateDim AS dd
      ON dd.date_key = tf.order_date_key
    WHERE dd.[year] = 2016
    GROUP BY ld.location_state, ld.location_city, dd.[year] 
) q
WHERE rn <= 5
ORDER BY location_state, [year], rn
LukStorms
  • 28,916
  • 5
  • 31
  • 45
  • @LukeStorms Can you explain how rn works there? Because this indeed solved my problem. – Seven Dec 10 '21 at 09:46
  • It generates a sequencial number without gaps per PARTITIONend columns. Also popular is DENSE_RANK, which would assign equal numbers to same values in the order by. F. e. 2 with same total_customers would get same dense_rank, but different row_number. – LukStorms Dec 10 '21 at 09:51
  • It works basically because [window functions](https://learn.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql) like that are processed after group by. – LukStorms Dec 10 '21 at 10:10
0

you can try to pass the value of each location_state to your query using the cross apply operator as follows:

;With STATES As (
SELECT location_state 
FROM LocationDim 
GROUP BY location_state)
SELECT T.[year], T.location_state, T.location_city, T.Number_of_Customers
FROM STATES CROSS APPLY (
            SELECT TOP 5
                [year], location_state, location_city,
                COUNT(tf.customer_key) AS Number_of_Customers
            FROM TransactionFact tf
            JOIN LocationDim as ld
                ON ld.location_key = tf.seller_location_key
            JOIN DateDim dd
                ON dd.date_key = tf.order_date_key
            WHERE dd.[year] = 2016 and location_state = STATES.location_state
            GROUP BY dd.[year], ld.location_state, ld.location_city
            ORDER BY dd.[year] DESC, Number_of_Customers DESC) As T
Anton Grig
  • 1,640
  • 7
  • 11
0

So, basically what's happening here:

    SELECT TOP 5
        dd.[year], ld.location_state, ld.location_city, COUNT(tf.customer_key) Number_of_Customers
    FROM TransactionFact tf
    JOIN LocationDim as ld
        ON ld.location_key = tf.seller_location_key
    JOIN DateDim dd
        ON dd.date_key = tf.order_date_key
    WHERE dd.year = 2016 and location_state = 'SP'
    GROUP BY dd.year, ld.location_state, ld.location_city
    ORDER BY dd.year DESC, Number_of_Customers DESC;

Is that you are selecting exclusively the top 5 results from that query.

But what you want is to get all results, ranked by year and state, and take only the top 5 from each state?

I'd use the RANK() function which is designed pretty much specifically for the scenario you're looking at. I'll show it as an added column on your query:

  SELECT * FROM ( SELECT dd.[year], ld.location_state, ld.location_city, COUNT(tf.customer_key) Number_of_Customers,
           RANK() OVER(PARTITION BY dd.[year], ld.location_state, ld.location_city 
                       ORDER BY Number_of_Customers DESC) r
    FROM TransactionFact tf
    JOIN LocationDim as ld
        ON ld.location_key = tf.seller_location_key
    JOIN DateDim dd
        ON dd.date_key = tf.order_date_key
    WHERE dd.year = 2016 and location_state = 'SP' 
    GROUP BY dd.year, ld.location_state, ld.location_city
     ) x WHERE x.r <= 5
     ORDER BY x.[year] desc, x.location_state, x.r

Alternatively, you could use a CTE (Common Table Expression) to hold the results from your first query, before applying the RANK:

;WITH cte AS(
   SELECT dd.[year], ld.location_state, ld.location_city, COUNT(tf.customer_key) Number_of_Customers, 
       RANK() OVER(PARTITION BY [year], location_state, location_city
                   ORDER BY Number_Of_Customers DESC) r
   FROM TransactionFact tf
        JOIN LocationDim as ld
            ON ld.location_key = tf.seller_location_key
        JOIN DateDim dd
            ON dd.date_key = tf.order_date_key
   WHERE dd.year = 2016 and location_state = 'SP'
   GROUP BY dd.year, ld.location_state, ld.location_city
)
SELECT *
FROM cte
WHERE r <= 5;

As a disclaimer, I'm only placing a semicolon before the WITH to indicate that if there is a statement beforehand that doesn't end with a semicolon, then this statement will throw an error.

EDIT: To add, using RANK means that the results get ranked by value. So if two cities have 30,000 customers, they will both get the same value for the RANK (similar to what they do in leaderboards when people are tied in a round of golf). Meaning you would get a minimum of 5 results from each state - if you want exactly 5, regardless of tied values, then you can use ROW_NUMBER in the same way.

Andrew Corrigan
  • 1,017
  • 6
  • 23
  • It throws an error "Invalid column name 'r'". I'm guessing it doesn't know the 'r' inside the ```where``` even though r was defined after the RANK() line – Seven Dec 10 '21 at 09:44
  • Just edited to put the RANK in the inner query and the WHERE in the outer. – Andrew Corrigan Dec 10 '21 at 09:51