-4

I'm building a Postgres database for a product search (up to 3 million products) with large groups of similar data for each product, e.g. the prices for different countries, and country-specific average ratings, with up to 170 countries.

The natural solution seems to use arrays (e.g. a real[] column for the prices and another for the ratings). However, the data needs to be indexed individually for each country for sorting and range queries (the data for different countries is not reliably correlated). So from this discussion I think it would be better to use individual columns for each country.

There are about 8 country-specific properties of which maybe 4 need to be indexed, so I may end up with more than 1300 columns and 650 indexes. Might that be a problem? Is there a better solution?


EDIT after everyone is telling me about many-to-many relationships, normalization and so on:

I am not convinced. If I understand correctly, this always comes down to a junction table (known under many names), as in Erwin Brandstetter's answer.

As I mentioned in my first comment, this would be a great solution if for each product there were prices and ratings for a few countries only. If this is not the case however, a junction table may lead to a significantly higher memory requirement (consider the ever-repeated product-id and country-id, and even more serious, the row-overhead for a narrow table with hundreds millions of rows).

Here is a Python script to demonstrate this. It creates a junction table product_country for prices and ratings of products in different countries, and a "multi-column table" products for the same. The tables are populated with random values for 100,000 products and 100 countries.

For simplicity I use ints to identify products and countries, and for the junction-table-approach, I only build the junction table.

import psycopg2
from psycopg2.extras import execute_values
from random import random
from time import time

cn = psycopg2.connect(...)
cn.autocommit = True
cr = cn.cursor()

num_countries = 100
num_products = 100000


def junction_table():
    print("JUNCTION TABLE")

    cr.execute("CREATE TABLE product_country (product_id int, country_id int, "
               "price real, rating real, PRIMARY KEY (product_id, country_id))")

    t = time()
    for p in range(num_products):
        # use batch-insert, without that it would be about 10 times slower
        execute_values(cr, "INSERT INTO product_country "
                           "(product_id, country_id, price, rating) VALUES %s",
                       [[p, c, random() * 100, random() * 5]
                        for c in range(num_countries)])
    print(f"Insert data took {int(time() - t)}s")

    t = time()
    cr.execute("CREATE INDEX i_price ON product_country (country_id, price)")
    cr.execute("CREATE INDEX i_rating ON product_country (country_id, rating)")
    print(f"Creating indexes took {int(time() - t)}s")

    sizes('product_country')


def many_column_table():
    print("\nMANY-COLUMN TABLE")

    cr.execute("CREATE TABLE products (product_id int PRIMARY KEY, "
               + ', '.join([f'price_{i} real' for i in range(num_countries)]) + ', '
               + ', '.join([f'rating_{i} real' for i in range(num_countries)]) + ')')

    t = time()
    for p in range(num_products):
        cr.execute("INSERT INTO products (product_id, "
                   + ", ".join([f'price_{i}' for i in range(num_countries)]) + ', '
                   + ", ".join([f'rating_{i}' for i in range(num_countries)]) + ') '
                   + "VALUES (" + ",".join(["%s"] * (1 + 2 * num_countries)) + ') ',
                   [p] + [random() * 100 for i in range(num_countries)]
                   + [random() * 5 for i in range(num_countries)])
    print(f"Insert data took {int(time() - t)}s")

    t = time()
    for i in range(num_countries):
        cr.execute(f"CREATE INDEX i_price_{i} ON products (price_{i})")
        cr.execute(f"CREATE INDEX i_rating_{i} ON products (rating_{i})")
    print(f"Creating indexes took {int(time() - t)}s")

    sizes('products')


def sizes(table_name):
    cr.execute(f"SELECT pg_size_pretty(pg_relation_size('{table_name}'))")
    print("Table size: " + cr.fetchone()[0])
    cr.execute(f"SELECT pg_size_pretty(pg_indexes_size('{table_name}'))")
    print("Indexes size: " + cr.fetchone()[0])


if __name__ == '__main__':
    junction_table()
    many_column_table()

Output:

JUNCTION TABLE
Insert data took 179s
Creating indexes took 28s
Table size: 422 MB
Indexes size: 642 MB

MANY-COLUMN TABLE
Insert data took 138s
Creating indexes took 31s
Table size: 87 MB
Indexes size: 433 MB

Most importantly, the total size (table+indexes) of the junction table is about twice the size of the many-column table, and the table-only size is even nearly 5 times larger.

This is easily explained by the row-overhead and the repeated product-id and country-id in each row (10,000,000 rows, vs. just 100,000 rows of the many-column table).

The sizes scale approximately linearly with the number of products (I tested with 700,000 products), so for 3 million products the junction table would be about 32 GB (12.7 GB relation + 19.2 GB indexes), while the many-column table would be just 15.6 GB (2.6 GB table + 13 GB indexes), which is decisive if everything should be cached in RAM.

Query times are about the same when all is cached, here a somewhat typical example for 700,000 products :

EXPLAIN (ANALYZE, BUFFERS)
SELECT product_id, price, rating FROM product_country
WHERE country_id=7 and price < 10
ORDER BY rating DESC LIMIT 200

-- Limit  (cost=0.57..1057.93 rows=200 width=12) (actual time=0.037..2.250 rows=200 loops=1)
--   Buffers: shared hit=2087
--   ->  Index Scan Backward using i_rating on product_country  (cost=0.57..394101.22 rows=74544 width=12) (actual time=0.036..2.229 rows=200 loops=1)
--         Index Cond: (country_id = 7)
--         Filter: (price < '10'::double precision)
--         Rows Removed by Filter: 1871
--         Buffers: shared hit=2087
-- Planning Time: 0.111 ms
-- Execution Time: 2.364 ms
EXPLAIN (ANALYZE, BUFFERS)
SELECT product_id, price_7, rating_7 FROM products
WHERE price_7 < 10
ORDER BY rating_7 DESC LIMIT 200

-- Limit  (cost=0.42..256.82 rows=200 width=12) (actual time=0.023..2.007 rows=200 loops=1)
--   Buffers: shared hit=1949
--   ->  Index Scan Backward using i_rating_7 on products  (cost=0.42..91950.43 rows=71726 width=12) (actual time=0.022..1.986 rows=200 loops=1)
--         Filter: (price_7 < '10'::double precision)
--         Rows Removed by Filter: 1736
--         Buffers: shared hit=1949
-- Planning Time: 0.672 ms
-- Execution Time: 2.265 ms

Regarding flexibility, data integrity etc., I see no serious problem with the multi-column approach: I can easily add and delete columns for countries, and if a sensible naming scheme is used for the columns it should be easy to avoid mistakes.

So I think I have every reason not to use a junction table.

Further, with arrays all would be clearer and simpler than with many columns, and if there were a way to easily define individual indexes for the array elements, that would be the best solution (maybe even the total indexes-size could be reduced?).

So I think my original question is still valid. However there is much more to consider and to test of course. Also, I'm in no way a database expert, so tell me if I'm wrong.

Here the test tables from the script for 5 products and 3 countries: product_country products

jake n
  • 342
  • 3
  • 15
  • 4
    No, the natural solution is a proper [many-to-many relationship](https://en.wikipedia.org/wiki/Many-to-many_(data_model)) –  Oct 17 '22 at 21:22
  • 1
    Does this answer your question? [Is storing a delimited list in a database column really that bad?](https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad) – philipxy Oct 17 '22 at 21:52
  • 5
    You seem to be uninformed of some of the most basic aspects of the relational model, information modelling & DB design. PS [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) – philipxy Oct 17 '22 at 21:56
  • [Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/q/285551/3404097) – philipxy Oct 19 '22 at 06:05
  • Suggest you follow a few (good) textbooks on the relational model, information modelling & DB design & querying. PS "the natural solution" when one is uniformed tells one nothing. PS A table of rows that could be represented by a key & 2 (roughly speaking) independent lists/arrays would be properly represented by 2 association/join/bridge/pivot/etc tables. On the other hand if the non-keys aren't (in that way) independent a 3-way relation(ship)/association represented by a 3-entity relation/table may be good. But what is appropriate depends on theory & heuristics not reflected in this post. – philipxy Oct 19 '22 at 06:22
  • Note that the former can arise by first eliminating one list via an association table, then the next list by another association table. But you don't even give the relevant information, you need to learn basics of modelling & design. And a column for every value in a range is almost always an antipattern. "Why" is a textbook. Do you really think that your situation is novel? (Rhetorical.) Good luck. – philipxy Oct 19 '22 at 06:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/248896/discussion-between-jake-n-and-philipxy). – jake n Oct 19 '22 at 07:14
  • [How can you represent inheritance in a database?](https://stackoverflow.com/q/3579079/3404097) [More](https://stackoverflow.com/q/190296/3404097) [How to design a product table for many kinds of product where each product has many parameters](https://stackoverflow.com/q/695752/3404097) [More](https://stackoverflow.com/a/2945124/3404097) [And more](https://stackoverflow.com/q/5106335/3404097) [Re EAV](https://stackoverflow.com/a/23950836/3404097) etc etc etc [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) – philipxy Oct 28 '22 at 23:08

1 Answers1

0

The "natural" solution for a relational database is to create additional tables in one-to-many or many-to-many relationships. Look into database normalization.

Basic m:n design for product ratings per country:

CREATE TABLE country (
  country_id varchar(2) PRIMARY KEY
, country text UNIQUE NOT NULL
);

CREATE TABLE product (
  product_id int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, product text NOT NULL
  -- more?
);

CREATE TABLE product_ratings (
  product_id int REFERENCES product
, country_id varchar(2) REFERENCES country
, rating1 real
, rating2 real
  -- more?
, PRIMARY KEY (product_id, country_id)   
);

fiddle

More details:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I will think more about your suggestion, but currently I don't see an advantage. _If_ many products had ratings just in some countries your solution had an obvious space advantage (but actually the source from which I scape the date gives me ratings for each country). However, for example a search for products sorted by the rating in the US would only be more complicated with your solution, and I can see no advantage in performance. – jake n Oct 17 '22 at 22:45
  • looking again at your answer I see I didn't make it clear that there is only one average rating per country. I still don't understand what your table `country` has to do with my problem. – jake n Oct 17 '22 at 23:03
  • 3
    @jaken The advantage of a properly [normalized](https://en.wikipedia.org/wiki/Database_normalization) data model is flexibility, scalability and avoiding write anomalies. –  Oct 18 '22 at 06:43
  • I updated the question to show why I don't think this is a good solution for my case. – jake n Oct 19 '22 at 05:32