225

I'm working on the design for a RoR project for my company, and our development team has already run into a bit of a debate about the design, specifically the database.

We have a model called Message that needs to be persisted. It's a very, very small model with only three db columns other than the id, however there will likely be A LOT of these models when we go to production. We're looking at as much as 1,000,000 insertions per day. The models will only ever be searched by two foreign keys on them which can be indexed. As well, the models never have to be deleted, but we also don't have to keep them once they're about three months old.

So, what we're wondering is if implementing this table in Postgres will present a significant performance issue? Does anyone have experience with very large SQL databases to tell us whether or not this will be a problem? And if so, what alternative should we go with?

Milen A. Radev
  • 60,241
  • 22
  • 105
  • 110
Dylan Karr
  • 3,304
  • 4
  • 19
  • 29
  • 5
    with a good caching layer and some little configuration in PG you should be fine. You should tackle performance issues case by case and avoid preoptimizing. That said, partitioning and replicating are always great options you can take advantage of once you hit bottlenecks. – Sam Feb 18 '14 at 21:56
  • 2
    Related question [here](http://stackoverflow.com/questions/13639626/database-columns-in-select-or-create-statements/13639920#13639920) and [here](http://stackoverflow.com/questions/12606842/what-is-the-maximum-number-of-columns-in-a-postgresql-select-query). – Erwin Brandstetter Feb 19 '14 at 00:37
  • 23
    We process about 30 million messages per day in one 5+ TB PostgreSQL database, works fine. – Frank Heikens Feb 19 '14 at 09:01
  • see also http://stackoverflow.com/questions/3132444/maximum-usable-number-of-rows-in-a-postgresql-table – rogerdpack Feb 27 '15 at 13:30
  • 1
    FYI, I happened to be reading https://www.postgresql.org/about/ today and noticed that it says that (in principle) the number of rows in a table is unlimited. – Al Chou Feb 05 '17 at 20:10
  • @FrankHeikens - I know this is an old comment, but I'm interested in knowing details about the use case and whether this setup is still working for you - could you elaborate, please? – favq Feb 02 '19 at 14:53
  • @FrankHeikens can you tell me about the hardware spec you are using? – Arya Jul 16 '19 at 07:23

2 Answers2

168

Rows per a table won't be an issue on it's own.

So roughly speaking 1 million rows a day for 90 days is 90 million rows. I see no reason Postgres can't deal with that, without knowing all the details of what you are doing.

Depending on your data distribution you can use a mixture of indexes, filtered indexes, and table partitioning of some kind to speed thing up once you see what performance issues you may or may not have. Your problem will be the same on any other RDMS that I know of. If you only need 3 months worth of data design in a process to prune off the data you don't need any more. That way you will have a consistent volume of data on the table. Your lucky you know how much data will exist, test it for your volume and see what you get. Testing one table with 90 million rows may be as easy as:

select x,1 as c2,2 as c3
from generate_series(1,90000000) x;

https://wiki.postgresql.org/wiki/FAQ

Limit   Value
Maximum Database Size       Unlimited
Maximum Table Size          32 TB
Maximum Row Size            1.6 TB
Maximum Field Size          1 GB
Maximum Rows per Table      Unlimited
Maximum Columns per Table   250 - 1600 depending on column types
Maximum Indexes per Table   Unlimited
Jarekczek
  • 7,456
  • 3
  • 46
  • 66
Kuberchaun
  • 29,160
  • 7
  • 51
  • 59
  • 33
    I agree that 90 million rows won't be a problem for PostgreSQL. But it *might* be a problem for an ORM with PostgreSQL. (An ORM with any dbms, actually.) – Mike Sherrill 'Cat Recall' Feb 19 '14 at 13:16
  • @MikeSherrill'Catcall' Good point, I was just focused on "How big is too big for a PostgreSQL table?" – Kuberchaun Feb 19 '14 at 14:23
  • @MikeSherrill'CatRecall' why it might be a problem for an ORM? :) – yeyo Jul 25 '15 at 13:30
  • 4
    @yeyo: Because ORMs usually use a lot of queries to get data that *could* be returned with only one or two. The OP is using Ruby on Rails. – Mike Sherrill 'Cat Recall' Jul 25 '15 at 14:58
  • 80
    This is a little late but I think that in a lot of cases (especially with rails / active record) it is common to completely remove the ORM from the equation and write a raw sql string to query for performance reasons. Don't let your ORM make data decisions for you! It's an accessory not an essential. – Stefan Theard May 24 '17 at 15:03
  • 2
    The about URL quoted in the URL does not show these limits currently - anyone know where it's moved to? – Shorn Jul 13 '18 at 21:36
  • 5
    It is not correct to say that Rails/ActiveRecord uses a lot of queries to get data that could be returned with only one or two. It is totally capable of doing the same thing in one or two queries, but you have to actually tell it to do so rather than relying on default behavior. The problem is that because it's so easy to use and so abstracted, a lot of queries are written by people with less experience who aren't necessarily aware of such things, let alone actively taking them into consideration. That's a general problem with making programming tools easy, not an ORM thing or a Rails thing. – TiggerToo Aug 06 '21 at 11:55
  • I believe the real OP question should be how much is too much for postgres with the specs X. With unlimited resources your only bearer would be the hard limitations of your compilation, which again, with unlimited resources could be re-builded for the specifc hardware architecture and limitation spectations. But I can't see how a billion 1kb rows would hit any kind of limit in any kind of rdbms that I know of. – KodornaRocks Oct 26 '21 at 22:57
  • I think the most important limit is max 32 TB bytes per table. All the other limits are so big that you cannot hit them with any sensible data structure. If you store 4 KB per human for the whole Earth, you're going to hit the 32 TB limit. – Mikko Rantalainen Mar 05 '23 at 08:53
111

Another way to speed up your queries significantly on a table with > 100 million rows is to cluster the table on the index that is most often used in your queries. Do this in your database's "off" hours. We have a table with > 218 million rows and have found 30X improvements.

Also, for a very large table, it's a good idea to create an index on your foreign keys.

EXAMPLE:

  1. Assume we have a table named investment in a database named ccbank.
  2. Assume the index most used in our queries is (bankid,record_date)

Here are the steps to create and cluster an index:

  1. psql -c "drop index investment_bankid_rec_dt_idx;" ccbank
  2. psql -c "create index investment_bankid_rec_dt_idx on investment(bankid, record_date);"
  3. psql -c "cluster investment_bankid_rec_dt_idx on investment;"
  4. vacuumdb -d ccbank -z -v -t investment

In steps 1-2 we replace the old index with a new, optimized one. In step 3 we cluster the table: this basically puts the DB table in the physical order of the index, so that when PostgreSQL performs a query it caches the most likely next rows. In step 4 we vacuum the database to reset the statistics for the query planner.

mfluehr
  • 2,832
  • 2
  • 23
  • 31
James Doherty
  • 1,271
  • 1
  • 8
  • 6
  • 2
    > in the off hours cluster the table on the index that is most often used in your queries....can you explain how this is done? – spy Nov 17 '19 at 02:18
  • 13
    Yes here is a step by step EXAMPLE: 1) The table I am referring to is called investment in this example. 2) The index most used in queries is (bankid,record_date) So here is your step by step: 1) psql -c "drop index investment_bankid_rec_dt_idx;" dbname 2) psql -c "create index investment_bankid_rec_dt_idx on investment(bankid, record_date);" 3) psql -c "cluster investment_bankid_rec_dt_idx on investment;" 4) vacuumdb -d ccbank -z -v -t investment So in step one and two we drop the index and recreate it. – James Doherty Nov 18 '19 at 12:37
  • 7
    Step 3 we create the cluster, this basically puts the DB table in the physical order of the index, so when postgresql performs a query it caches the most likely next rows. Step 4 we vacuum the database to reset the statistics for the query planner – James Doherty Nov 18 '19 at 12:42
  • Can this be used to avoid the 32 TB per relation limit? – Mikko Rantalainen Mar 05 '23 at 08:55