I have following star schema which has bunch of tables involved as shown below to determine availability of book in a library
f_book_availability (This table consist of availability of books with references to other dimension table which i will explain shortly) Index primary key
+-------------------------------------------------------------------------------+
| id | book_id |publisherid | location_id | genre_id | date_id| available |
| | | | | | | |
+-------------------------------------------------------------------------------+
| 1 | 1 | 1 | 72 | 1 | 1 | 1 |
| | | | | | | |
+-------------------------------------------------------------------------------+
| 2 | 2 | 1 | 60 | 2 | 1 | 1 |
| | | | | | | |
+-------------------------------------------------------------------------------+
d_book - This dimension table has details on book like name and type. Type only has id 1 and 2. 1 means "published by publisher" and 2 means "self-published" and does not have any reference table. Index primary key
+-----------+-----------+------------+
| id | type | name |
+------------------------------------+
| 1 | 1 | LOR |
+------------------------------------+
| 2 | 2 | My life |
+-----------+-----------+------------+
d_publisher : This dimension table has publisher information. Index primary key
+-----------+------------
| id | name |
+-----------------------+
| 1 | abc |
+-----------------------+
| 2 | def |
+-----------+------------
d_location- This dimension is a tricky one. (Note schema is already there and I cannot modify it). It has id of place and id of parent place. Note for id we save the hierarchy for example if you pick id 72 which is a leaf node and tells the rack of library, you can see for each leaf node there are four entries with different parents from which you can know the hierarchy country->City->Library->Rack. And this holds true for each place. For example if you have place for Library then you can find the hierarchy country->City->Library. Indexes 1) Primary key (id & parentId), 2) parentId
+-----------+-----------+------------+------------+
| id | parent_id | name | c_code |
| | | | |
+-------------------------------------------------+
| 1 | 1 | France | FR |
+-------------------------------------------------+
| 4 | 1 | Paris | FR |
+-------------------------------------------------+
| 4 | 4 | Paris | FR |
+-------------------------------------------------+
| 25 | 1 | GtLibrary | FR |
+-------------------------------------------------+
| 25 | 4 | GtLibrary | FR |
+-------------------------------------------------+
| 25 | 25 | GtLibrary | FR |
+-------------------------------------------------+
| 72 | 1 | Rack1 | FR |
+-------------------------------------------------+
| 72 | 4 | Rack1 | FR |
+-------------------------------------------------+
| 72 | 25 | Rack1 | FR |
+-------------------------------------------------+
| 72 | 72 | Rack1 | FR |
+-----------+-----------+------------+------------+
d_genre : This dimension table has genre information Index primary key
+-----------+------------
| id | name |
+-----------------------+
| 1 | fantasy |
+-----------------------+
| 2 | horror |
+-----------+------------
d_date : This dimension table has all dates (Note there are other columns which i am not showing like month, day, year, dayofweek, weeknumber but i am not showing it just for simplicity just to let you know its not just date as it looks stupid :) ) Indexes - 1) primary key 2) date
+-----------+--------------
| id | date |
+-------------------------+
| 1 | 2020-11-25|
+-------------------------+
| 2 | 2019-10-24|
+-----------+--------------
From this table i am trying to exact information whether a book is available on a certain date with information like its publisher, genre, date, location and its immediate parent location.
I have written the following query
select
fba.id,
location.c_code as country,
parentLocation.name as parentPlace,
location.id as locationId,
location.name as locationName,
publisher.id as "publisherId",
publisher.name as publisherName,
case when book.type = 1 then 'published' else 'self-published' end as "bookType",
book.type as typeId,
genre.name as genreName,
book.id as "bookId",
book.name as bookTitle,
d."date",
fba.available
from f_book_availability fba
join d_book book on fba.product_id = product.id
join d_publisher publisher on fba.publisherid = publisher.id
join d_location location on fba.location_id = location.id
join d_location parentLocation on location.parent_id = parentLocation.id
join d_genre genre on fba.genre_id = genre.id
join d_date d on fba.date_id = d.id
where
location.id <> location.parent_id
and d."date" >= now() and d."date" <= '2020-12-01'
and location.c_code ='FR'
and book.type = 1
and genre.name = 'fantasy'
Actual Output
+--------+---------+-------------------------+--------------+----------------------------+----------+--------+------------+--------+-----------+-----------+-----------+
| fba_id | country | parentPlace| locationId | locationName | publisherid| publisherName | bookType | typeId | genreName | bookId | bookTitle | date | available |
| | | | | | | | | | | | | | |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | France | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | Paris | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | Paris | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | GtLibrary | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | GtLibrary | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | | | | | | | | | | | | |
| 1 | FR | GtLibrary | 72 | Rack1 | 1 | abc | published| 1 | fantasy | 1 | LOR | 2020-11-25| 1 |
+--------+---------+------------+------------+--------------+------------+---------------+-------------------+------------+--------+-----------+-----------------------+
Expected output: As you can see above there are duplicates because of the join with parent location and condition.
- The best thing i want to achieve is just get one 1 row that is the final row for rack and library. It would have been easier if location had an immediateParent flag but I cannot change the schema
- The second best thing would have been to have distinct record so in this case there will be only 3 records rack with library, rack with city, rack with country. This would be ok if i cannot achieve the first option. However distinct clause has lot of cost and i cannot figure out how to reduce the cost
Cost
Unique (cost=12070116.66..12072185.22 rows=63648 width=179)
-> Sort (cost=12070116.66..12070275.78 rows=63648 width=179)
Sort Key: fba.id, parentLocation.name, location.id, location.name, publisher.id, publisher.name, (CASE WHEN (book.type = 1) THEN 'published'::text ELSE 'self-published'::text END), genre.name, book.id, book.name, d.date, fba.available
-> Hash Join (cost=3316.39..12059378.75 rows=63648 width=179)
Hash Cond: (location.parent_id= parentLocation.id)
-> Hash Join (cost=2601.08..12057653.07 rows=19090 width=141)
Hash Cond: (fa.publisher_id = publisher.id)
-> Hash Join (cost=2475.28..12057477.06 rows=19090 width=120)
Hash Cond: (fba.date_id = d.id)
-> Gather (cost=2466.05..12051967.29 rows=2092656 width=124)
Workers Planned: 2
-> Hash Join (cost=1466.05..11841701.69 rows=871940 width=124)
Hash Cond: (fba.location_id = location.id)
-> Hash Join (cost=820.92..11816871.09 rows=1374762 width=99)
Hash Cond: (fa.book_id = book.id)
-> Hash Join (cost=8.30..11808952.45 rows=2706393 width=54)
Hash Cond: (fba.genre_id = genre.id)
-> Parallel Seq Scan on f_book_availability fba (cost=0.00..11047094.57 rows=278758458 width=45)
-> Hash (cost=8.29..8.29 rows=1 width=25)
-> Seq Scan on d_genre genre(cost=0.00..8.29 rows=1 width=25)
Filter: ((tech_en)::text = 'fantasy'::text)
-> Hash (cost=702.26..702.26 rows=8829 width=53)
-> Seq Scan on d_book book (cost=0.00..702.26 rows=8829 width=53)
Filter: (type = 1)
-> Hash (cost=613.88..613.88 rows=2500 width=33)
-> Seq Scan on d_location location (cost=0.00..613.88 rows=2500 width=33)
Filter: ((id <> parent_id) AND ((c_code)::text = 'FR'::text))
-> Hash (cost=8.86..8.86 rows=29 width=8)
-> Index Scan using date_unique on d_date d (cost=0.28..8.86 rows=29 width=8)
Index Cond: ((date >= now()) AND (date <= '2020-12-01'::date))
-> Hash (cost=98.69..98.69 rows=2169 width=29)
-> Seq Scan on d_publisher publisher (cost=0.00..98.69 rows=2169 width=29)
-> Hash (cost=546.25..546.25 rows=13525 width=22)
-> Seq Scan on d_location parentLocation (cost=0.00..546.25 rows=13525 width=22)
Sorry if the post is not meant to be in here and if there is typo as I had to work for two hours to create that ascii table and there may be typos as i changed column names just to give the example