Partitioning Tables: Where to Begin?

Question

I have a table with 100 million rows and this is only going to continue to grow.

In order to speed up select queries, I want to partition this table, what is the best approach to doing this? I have never partitioned a table before and I'm not sure where to even begin.

I do have indexes on the columns most often used for where clauses, group by, or order by.

Edit Part 1: Right now, my select queries take approximately 55-60 seconds each, with indexes.

Edit Part 2: My table is not normalized up to 3NF. It's currently one giant table in a 1NF form. The table has approximately 13 columns with 100 million + rows.

I only have indexes on columns that I use to filter my select queries on.

The cardinality of the columns that are indexed is pretty high with many different values (for example, email, employee number, dates, etc.)

Edit Part 3: I looked at the link that Mark posted and it is a very good reference, I just have a few questions about the function that they used.

CREATE OR REPLACE FUNCTION measurement_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
    IF ( NEW.logdate >= DATE '2006-02-01' AND
         NEW.logdate < DATE '2006-03-01' ) THEN
        INSERT INTO measurement_y2006m02 VALUES (NEW.*);
    ELSIF ( NEW.logdate >= DATE '2006-03-01' AND
            NEW.logdate < DATE '2006-04-01' ) THEN
        INSERT INTO measurement_y2006m03 VALUES (NEW.*);
    ...
    ELSIF ( NEW.logdate >= DATE '2008-01-01' AND
            NEW.logdate < DATE '2008-02-01' ) THEN
        INSERT INTO measurement_y2008m01 VALUES (NEW.*);
    ELSE
        RAISE EXCEPTION 'Date out of range.  Fix the measurement_insert_trigger() function!';
    END IF;
    RETURN NULL;
END;
$$
LANGUAGE plpgsql;

What exactly does $$ mean? I have never created a function before using SQL and am new to programming in general.

When you are using the NEW.* part, NEW refers to the new values being entered in? It's not a particular table or anything, it's just new rows being entered. You use "NEW" to represent the new rows, correct? I have been experimenting with this and that's the conclusion that I came to, but I just wanted to see what others thought.

Edit Part 4: $$ is apparently used to indicate when you are starting and ending a function. It's also used to write string constants. I learned this from What are '$$' used for in PL/pgSQL

Is your table normalised upto 3NF ? Why do you have seconday indexes on non-key columns ? What is the cardinality of these columns ? — wildplasser, May 15 '16 at 16:37
Reading [this](http://www.postgresql.org/docs/current/static/ddl-partitioning.html) from the PostgreSQL documentation would be a good first step. I think it will cover more than any Stack Overflow answer could. — Mark Hildreth, May 15 '16 at 18:10
`It's currently one giant table in a 1NF form.` That is not what I would call a table. it is a spreadsheet with indexes. — wildplasser, May 15 '16 at 22:04
I thought 3NF is best when doing inserts/updates/deletes but slow when doing select queries? I thought 1NF form is best for just selecting data? That's why I put it in 1NF form. — Chicken Sandwich No Pickles, May 15 '16 at 22:12
@MarkHildreth I have edited my post, that thread was very helpful. I have added my new questions (edited part 3) to my post. Those are my only questions remaining and then we can mark this thread solved =) — Chicken Sandwich No Pickles, May 15 '16 at 23:59

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

There are many ways you can partition a table, but you may use other strategies rather than partitioning, such as partial indexes. It will be very difficult to provide you with any advice unless you include your table's schema, a sample of the data in it, a sample query you normally run, and with it the results of EXPLAIN ANALYZE for that query, to get an idea of what PostgreSQL is doing and where is the impact.

Part 1

55-60 seconds per query is generally considered unacceptable. You have some serious fixing to do there.

Part 2

Normalisation is not necessarily a requirement nor a way to address performance issues. It really depends on how you source and access your data. 13 columns is not much at all, nor they are 100 million rows.

Having indexes on columns you don't filter by would be a total waste, so you're good there. However, having an index on a column does not guarantee it will be used: it depends on the planner's strategy for a given query, the cardinality of the data, and the kind of filters and/or join conditions in your query. Again, having an example of query you run along with an explain analyze of that query would be essential for providing any kind of useful advice.

Part 3

$$ is an alternative form of quoting. It is equivalent to single quote ', but can be nested with single quote. Functions are created as a string, and creating one with quotes would require any quotes in the function to be escaped (doubled), like this:

CREATE OR REPLACE FUNCTION measurement_insert_trigger() RETURNS TRIGGER AS '
BEGIN
    IF ( NEW.logdate >= DATE ''2006-02-01'' AND
         NEW.logdate < DATE ''2006-03-01'' ) THEN
...
END;
' LANGUAGE plpgsql;

Also, editors with syntax highlight like PgAdmin3 would paint the whole string as red, rather than continuing to highlight normal SQL syntax. This is what $$ is used for. Also, dollar-quotes can be named and nested, as follows:

CREATE OR REPLACE FUNCTION tablecount(tablename text) RETURNS bigint AS $func$
DECLARE
  _cnt bigint;
BEGIN
  IF EXISTS (SELECT 1 FROM pg_class WHERE relname = $1) THEN
    EXECUTE $exec$ SELECT COUNT(*) FROM $exec$||quote_ident($1) into _cnt;
  END IF;
  RETURN _cnt;
END;
$func$ LANGUAGE plpgsql;

NEW and OLD are RECORDS, which in triggers are set with old and new values, depending on the type of trigger and whether the trigger is executed before or after changes. Fields in records can be accessed in the same way they can be accessed by the alias of a table in a query, e.g. NEW.id.

You can find all of this well explained in http://www.postgresql.org.

Partitioning Tables: Where to Begin?

1 Answers1

Part 1

Part 2

Part 3