0

It may be a silly basic question but as described in the title, I am wondering how PostgreSQL deals with performance when having millions of entries (with the possibility of reaching a billion entries).

To put it in a more concrete way, I want to store data (audio, photos and videos) in my database (I'm only storing their path, files are organised in the file system), but I have to decide wether I use a single table "data" to store all the different types of data, or multiple tables ("data_audio", "data_photos", "data_videos") to separate those types.

The reason why I am asking this question is that I have something like 95% of photos and 5% of audio and videos, and if I want to query my database for an audio entry, I don't want it to be slowed by all the photos entries (searching for a line among a thousand must be different than searching among a million). So I would like to know how PostgreSQL deals with this and if there exists some way to have the best optimisation.

I have read this topic that is really interesting and seems relevant: How does database indexing work?

Is it the way I should do?

Recap of the core stored informations I will have in my core tables:

1st option:

  • DATA TABLE (containing audio, photos and videos):

    • id of type bigserial
    • _timestamp of type timestamp
    • path_file of type text
  • USERS TABLE:

    • id of type serial
    • forename of type varchar(255)
    • surname of type varchar(255)
    • birthdate of type date
    • email_address of type varchar(255)
  • DATA USERS RELATION TABLE:

    • id_data of type bigserial
    • id_user of type serial
  • ACTIVITIES TABLE:

    • id of type serial
    • name of type varchar(255)
    • description of type text
  • DATA ACTIVITIES RELATION TABLE:

    • id_data of type bigserial
    • id_activity of type serial

(SEARCH queries are mainly done on DATA._timestamp and ACTIVITIES.name fields after filtering data by USERS.id)

2nd option (only switching the previous DATA TABLE with the following three tables and keeping all the other tables):

  • DATA_AUDIO TABLE
  • DATA_PHOTOS TABLE
  • DATA_VIDEOS TABLE

Additional question: Is it a good idea to have a database per user ? (in the storyline, to be able to query the database for data depends on whether you have the permission or not, and if you want to retrieve data from two different users, you have to ask the permission from both users, and the permission process is a process in its own right, it is not handled here, so let’s say that when you query the database, it will always be queries on the same user)

I hope I have been clear, thanks in advance for any help or advices!

Cyrille

  • Millions of rows isn't much of a feat really, and billions is perfectly possible when optimising the database properly. Simply do a test by generating a bunch of random data. – deceze Mar 15 '18 at 14:49
  • Note your junction ("*bridge*") tables `data_users` and `data_activities` should *not* contain (big)serials, but (big)ints : FKs referencing the base tables's id columns. – joop Mar 15 '18 at 15:19
  • Thank you very much @deceze and joop, I'm going to try / correct as you said! – Cyrille Dakhlia Mar 15 '18 at 15:35

1 Answers1

1

Answers:

  • PostgreSQL is cool with millions and billions of rows.

  • If the different types of data all have the same attributes and are the same from the database perspective (have the same relationships to other tables etc.), then keep them in one table. If not, use different tables.

  • The speed of index access to a table does not depend on the size of the table.

  • If the data of different users have connections, like they use common base tables or you want to be able to join tables for different users, it is best to keep them in different schemas in one database. If it is important that they be separated no matter what, keep them in different databases.

    It is also an option to keep data for different users in one table, if you use Row Level Security or let your application take care of it.

    This decision depends strongly on your use case and architecture.

    Warning: don't create clusters with thousands of databases and databases with thousands of schemas. That causes performance problems in the catalogs.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Thank you very much for your complete answer, it is a huge help! I'm going to think about all of it and continue with the light of your answers! – Cyrille Dakhlia Mar 15 '18 at 15:51
  • Hi, I have another question still on the performance matter. Keeping all data in the same table for all users is more manageable, but I was wondering how much it will affect the access time if I have 50 to 100 users trying to access this table. I need my system to be fast and simple (everyone wants this I guess hahaha), but I'd say that it is more important for me to have a fast access time to deliver a better service. How does PostgreSQL deal with this? Should I use multiple databases or servers (if ever I have some)? – Cyrille Dakhlia Mar 20 '18 at 14:08
  • PostgreSQL is great with concurrency. 50 or 100 concurrent readers probably won't slow things down. – Laurenz Albe Mar 20 '18 at 14:32
  • Thank you very much, I will gather my data and keep everything in the same database on the same server for now, and will see in the future if it needs changes or need to scale, thank you again ! – Cyrille Dakhlia Mar 21 '18 at 08:21