0

I have following star schema which has bunch of tables involved as shown below to determine availability of book in a library

f_book_availability (This table consist of availability of books with references to other dimension table which i will explain shortly) Index primary key



   +-------------------------------------------------------------------------------+
   |   id   |  book_id |publisherid | location_id | genre_id | date_id| available  |
   |        |          |            |             |          |        |            |
   +-------------------------------------------------------------------------------+
   |   1    |  1       |    1       |      72     |   1      |   1    |    1       |
   |        |          |            |             |          |        |            |
   +-------------------------------------------------------------------------------+
   |   2    |  2       |    1       |      60     |  2       |   1    |     1      |
   |        |          |            |             |          |        |            |
   +-------------------------------------------------------------------------------+

d_book - This dimension table has details on book like name and type. Type only has id 1 and 2. 1 means "published by publisher" and 2 means "self-published" and does not have any reference table. Index primary key

+-----------+-----------+------------+
| id        | type      |  name      |
+------------------------------------+
| 1         |  1        |  LOR       |
+------------------------------------+
| 2         |  2        |  My life   |
+-----------+-----------+------------+

d_publisher : This dimension table has publisher information. Index primary key

+-----------+------------
|  id       |   name    |
+-----------------------+
|  1        |   abc     |
+-----------------------+
|  2        |   def     |
+-----------+------------

d_location- This dimension is a tricky one. (Note schema is already there and I cannot modify it). It has id of place and id of parent place. Note for id we save the hierarchy for example if you pick id 72 which is a leaf node and tells the rack of library, you can see for each leaf node there are four entries with different parents from which you can know the hierarchy country->City->Library->Rack. And this holds true for each place. For example if you have place for Library then you can find the hierarchy country->City->Library. Indexes 1) Primary key (id & parentId), 2) parentId

+-----------+-----------+------------+------------+
| id        | parent_id |  name      | c_code     |
|           |           |            |            |
+-------------------------------------------------+
| 1         |  1        |  France    |  FR        |
+-------------------------------------------------+
| 4         |  1        |  Paris     |  FR        |
+-------------------------------------------------+
| 4         |  4        |  Paris     |  FR        |
+-------------------------------------------------+
| 25        |  1        |  GtLibrary |  FR        |
+-------------------------------------------------+
| 25        |  4        |  GtLibrary |  FR        |
+-------------------------------------------------+
| 25        |  25       |  GtLibrary |  FR        |
+-------------------------------------------------+
| 72        |  1        |  Rack1     |  FR        |
+-------------------------------------------------+
| 72        |  4        |  Rack1     |  FR        |
+-------------------------------------------------+
| 72        |  25       |  Rack1     |  FR        |
+-------------------------------------------------+
| 72        |  72       |  Rack1     |  FR        |
+-----------+-----------+------------+------------+

d_genre : This dimension table has genre information Index primary key

+-----------+------------
|  id       |   name    |
+-----------------------+
|  1        |   fantasy |
+-----------------------+
|  2        |   horror  |
+-----------+------------

d_date : This dimension table has all dates (Note there are other columns which i am not showing like month, day, year, dayofweek, weeknumber but i am not showing it just for simplicity just to let you know its not just date as it looks stupid :) ) Indexes - 1) primary key 2) date

+-----------+--------------
|  id       |   date      |
+-------------------------+
|  1        |   2020-11-25|
+-------------------------+
|  2        |   2019-10-24|
+-----------+--------------

From this table i am trying to exact information whether a book is available on a certain date with information like its publisher, genre, date, location and its immediate parent location.

I have written the following query

select 
    fba.id,
    location.c_code as country,
    parentLocation.name as parentPlace,
    location.id as locationId,
    location.name as locationName,
    publisher.id as "publisherId",
    publisher.name as publisherName,
    case when book.type = 1 then 'published' else 'self-published' end as "bookType",
    book.type as typeId,
    genre.name as genreName,
    book.id as "bookId",
    book.name as bookTitle,
    d."date",
    fba.available 
        from f_book_availability fba 
            join d_book book on fba.product_id = product.id 
            join d_publisher publisher on fba.publisherid = publisher.id
            join d_location location on fba.location_id = location.id 
            join d_location parentLocation on location.parent_id = parentLocation.id
            join d_genre genre on fba.genre_id = genre.id 
            join d_date d on fba.date_id = d.id
            where 
                location.id <> location.parent_id 
                and d."date" >= now() and d."date" <= '2020-12-01'
                and location.c_code ='FR'
                and book.type = 1
                and genre.name = 'fantasy'

Actual Output

+--------+---------+-------------------------+--------------+----------------------------+----------+--------+------------+--------+-----------+-----------+-----------+
| fba_id | country | parentPlace| locationId | locationName | publisherid| publisherName | bookType | typeId | genreName  | bookId | bookTitle | date      | available |
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | France     | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | Paris      | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | Paris      | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | GtLibrary  | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | GtLibrary  | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        |         |            |            |              |            |               |          |        |            |        |           |           |           |
| 1      | FR      | GtLibrary  | 72         | Rack1        | 1          | abc           | published| 1      | fantasy    | 1      | LOR       | 2020-11-25| 1         |
+--------+---------+------------+------------+--------------+------------+---------------+-------------------+------------+--------+-----------+-----------------------+

Expected output: As you can see above there are duplicates because of the join with parent location and condition.

  1. The best thing i want to achieve is just get one 1 row that is the final row for rack and library. It would have been easier if location had an immediateParent flag but I cannot change the schema
  2. The second best thing would have been to have distinct record so in this case there will be only 3 records rack with library, rack with city, rack with country. This would be ok if i cannot achieve the first option. However distinct clause has lot of cost and i cannot figure out how to reduce the cost

Cost

Unique  (cost=12070116.66..12072185.22 rows=63648 width=179)
  ->  Sort  (cost=12070116.66..12070275.78 rows=63648 width=179)
        Sort Key: fba.id, parentLocation.name, location.id, location.name, publisher.id, publisher.name, (CASE WHEN (book.type = 1) THEN 'published'::text ELSE 'self-published'::text END), genre.name, book.id, book.name, d.date, fba.available
        ->  Hash Join  (cost=3316.39..12059378.75 rows=63648 width=179)
              Hash Cond: (location.parent_id= parentLocation.id)
              ->  Hash Join  (cost=2601.08..12057653.07 rows=19090 width=141)
                    Hash Cond: (fa.publisher_id = publisher.id)
                    ->  Hash Join  (cost=2475.28..12057477.06 rows=19090 width=120)
                          Hash Cond: (fba.date_id = d.id)
                          ->  Gather  (cost=2466.05..12051967.29 rows=2092656 width=124)
                                Workers Planned: 2
                                ->  Hash Join  (cost=1466.05..11841701.69 rows=871940 width=124)
                                      Hash Cond: (fba.location_id = location.id)
                                      ->  Hash Join  (cost=820.92..11816871.09 rows=1374762 width=99)
                                            Hash Cond: (fa.book_id = book.id)
                                            ->  Hash Join  (cost=8.30..11808952.45 rows=2706393 width=54)
                                                  Hash Cond: (fba.genre_id = genre.id)
                                                  ->  Parallel Seq Scan on f_book_availability fba  (cost=0.00..11047094.57 rows=278758458 width=45)
                                                  ->  Hash  (cost=8.29..8.29 rows=1 width=25)
                                                        ->  Seq Scan on d_genre genre(cost=0.00..8.29 rows=1 width=25)
                                                              Filter: ((tech_en)::text = 'fantasy'::text)
                                            ->  Hash  (cost=702.26..702.26 rows=8829 width=53)
                                                  ->  Seq Scan on d_book book  (cost=0.00..702.26 rows=8829 width=53)
                                                        Filter: (type = 1)
                                      ->  Hash  (cost=613.88..613.88 rows=2500 width=33)
                                            ->  Seq Scan on d_location location  (cost=0.00..613.88 rows=2500 width=33)
                                                  Filter: ((id <> parent_id) AND ((c_code)::text = 'FR'::text))
                          ->  Hash  (cost=8.86..8.86 rows=29 width=8)
                                ->  Index Scan using date_unique on d_date d  (cost=0.28..8.86 rows=29 width=8)
                                      Index Cond: ((date >= now()) AND (date <= '2020-12-01'::date))
                    ->  Hash  (cost=98.69..98.69 rows=2169 width=29)
                          ->  Seq Scan on d_publisher publisher  (cost=0.00..98.69 rows=2169 width=29)
              ->  Hash  (cost=546.25..546.25 rows=13525 width=22)
                    ->  Seq Scan on d_location parentLocation  (cost=0.00..546.25 rows=13525 width=22)

Sorry if the post is not meant to be in here and if there is typo as I had to work for two hours to create that ascii table and there may be typos as i changed column names just to give the example

Sandeep Nair
  • 436
  • 2
  • 15
  • 1
    Hi - I'm afraid your data design is fundamentally broken and would need to be fixed in order for your queries to work. In dimensional modelling we create unique surrogate keys as the PKs for dimensions and use these as FKs in our fact tables. Unfortunately, in your design (and as you noted) you are using what I assume is a source ID as the identifier in d_location and this is not unique to records in this table - hence your joins and your queries don't work. Any solution other than redesigning your data is just going to be putting a plaster over the issue rather than actually fixing it – NickW Nov 03 '20 at 11:00
  • Upvoting for suggestion. I get your point. Actually d_location has composite key of id and parent_id. This dimension thus helps us tremendously to see the data with hierarchy in other use cases – Sandeep Nair Nov 03 '20 at 11:33
  • 1
    There is nothing fundamentally wrong with the data in d_location (though I would de-normalise all the parent fields into this table) so I'm not suggesting you need to redesign that - it is the key design for the table that is wrong and needs to be corrected, You shouldn't have a composite PK in a Dimension table - as you have discovered by the fact that your queries don't work – NickW Nov 03 '20 at 11:39
  • Makes sense. Thanks for taking time and replying. If you would have answered it, I would happily mark it as Solution – Sandeep Nair Nov 03 '20 at 12:48

1 Answers1

1

Added as an answer so it can be ticked, please :)

There is nothing fundamentally wrong with the data in d_location (though I would de-normalise all the parent fields into this table) so I'm not suggesting you need to redesign that - it is the key design for the table that is wrong and needs to be corrected, You shouldn't have a composite PK in a Dimension table - as you have discovered by the fact that your queries don't work

NickW
  • 8,430
  • 2
  • 6
  • 19