How to get back aggregate values across 2 dimensions using Python Cubes?

Question

Situation

Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:

The same in text format:

Store table

------------------------------
| id  | code | address       |
|-----|------|---------------|
| 1   | S1   | Kings Row     |
| 2   | S2   | Queens Street |
| 3   | S3   | Jacks Place   |
| 4   | S4   | Diamonds Alley|
| 5   | S5   | Hearts Road   |
------------------------------

Product table

------------------------------
| id  | code | name          |
|-----|------|---------------|
| 1   | P1   | Saucer 12     |
| 2   | P2   | Plate 15      |
| 3   | P3   | Saucer 13     |
| 4   | P4   | Saucer 14     |
| 5   | P5   | Plate 16      |
|  and many more ....        |
|1000 |P1000 | Bowl 25       |
|----------------------------|

Sales table

----------------------------------------
| id  | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1   | 1          | 1        |7.05    |
| 2   | 1          | 2        |9.00    |
| 3   | 2          | 3        |1.00    |
| 4   | 2          | 3        |1.00    |
| 5   | 2          | 5        |1.00    |
|  and many more ....                  |
| 1000| 20         | 4        |1.00    |
|--------------------------------------|

The relationships are:

Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales

What I want to achieve

I want to use cubes to be able to do a display by pagination in the following manner:

Given the stores S1-S3:

-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9   | 0  |
|Plate 15 |0   |0   | 2  |
|  and many more ....    |
|------------------------|

Note the following:

Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.

What I tried

This is the configuration I used:

"cubes": [
    {
        "name": "sales",
        "dimensions": ["product", "store"],
        "joins": [
            {"master":"product_id", "detail":"product.id"},
            {"master":"store_id", "detail":"store.id"}
        ]
    }
],
"dimensions": [
    { "name": "product", "attributes": ["code", "name"] },
    { "name": "store", "attributes": ["code", "address"] }
]

This is the code I used:

 result = browser.aggregate(drilldown=['Store','Product'],
                               order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])

I didn't get what I want.
I got it like this:

----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1          | 1        |       7.05          |
| 1          | 2        |       9             |
| 2          | 3        |       2.00          |
|  and many more ....                         |
|---------------------------------------------|

which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.

My question

How do I get what I want?

Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?

Update

I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations

Cross-posted at Cubes GitHub issues to get more attention.

BTW, I assume these are backwards: `3. Sales has many Store 4. Sales has many Product`. Should be: `3. Store has many Sales 4. Product has many Sales` — Erwin Brandstetter, Jul 24 '16 at 04:02

score 1 · Answer 1 · edited May 23 '17 at 12:02

This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:

PostgreSQL Crosstab Query

And this about the "extra" column in the crosstab() output:

Pivot on Multiple Columns using Tablefunc

SELECT product_id, product
     , COALESCE(s1, 0) AS s1               --  1. ... displayed 0 instead of null
     , COALESCE(s2, 0) AS s2
     , COALESCE(s3, 0) AS s3
     , COALESCE(s4, 0) AS s4
     , COALESCE(s5, 0) AS s5
FROM   crosstab(
     'SELECT s.product_id, p.name, s.store_id, s.sum_amount
      FROM   product p
      JOIN  (
         SELECT product_id, store_id
              , sum(amount) AS sum_amount  -- 3. SUM total of product spent in store
         FROM   sales
         GROUP  BY product_id, store_id
         ) s ON p.id = s.product_id
      ORDER  BY s.product_id, s.store_id;'
   , 'VALUES (1),(2),(3),(4),(5)'          -- desired store_id's
   ) AS ct (product_id int, product text   -- "extra" column
          , s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER  BY s3 DESC;                         -- 2. ... descending order for S3

Produces your desired result exactly (plus product_id).

To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.

SQL Fiddle with base query.
_{The tablefunc module is not installed on sqlfiddle.}

Major points

Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
- PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
- Dynamic alternative to pivot with CASE and GROUP BY

I also want to have pagination.

That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:

Displaying data in grid view page by page

I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.

To optimize performance for big result sets, consider:

If I used postgres queries, I would need to write the sane query every time I need something like this. I chose cubes so that it's more easily reusable. I prefer an answer using the cubes library — Kim Stacks, Jul 24 '16 at 03:31
@KimStacks: I am not familiar with the cubes library. But I am very familiar with this kind of problem, and this is the most efficient solution by far. But yes, the SQL query needs to be adapted if shops in the result change. — Erwin Brandstetter, Jul 24 '16 at 03:39
Awesome detailed answer though. I may have to write my own function to build the query using yours as template — Kim Stacks, Jul 24 '16 at 03:39
@KimStacks: I added another link that might help with dynamic crosstab. — Erwin Brandstetter, Jul 24 '16 at 03:42
Suppose I use your query in my django app, that means I need to execute the queries raw. Can I make it work with a django model class? Especially when the fields are going to be dynamic. https://docs.djangoproject.com/en/1.9/topics/db/sql/ — Kim Stacks, Jul 24 '16 at 04:19
You could wrap it into a plpgsql function to write a table or be used in a materialized view. Or make it a table function and simply call with `SELECT * FROM my_function()`. [Several examples in this linked answer.](http://stackoverflow.com/a/15514334/939860) Not sure about the django model class. I am an expert with Postgres, with django not so much. — Erwin Brandstetter, Jul 24 '16 at 04:29
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/118120/discussion-between-kim-stacks-and-erwin-brandstetter). — Kim Stacks, Jul 24 '16 at 04:46