How to duplicate the amount of data in a PostgreSQL database?

Question

In order to evaluate the load of our platform (django + postgresql) I would like to literally duplicate the amount of data in the system. Its a bit complicated to create mocks that could emulate the different kind of objects (since we have a very complex data model). Is there a way to create a duplicate of the database, override primary keys and unique fields for unused ones an merge it with the original?

Consider natural keys. To create fixtures without pk, see this: http://stackoverflow.com/questions/9436954/excluding-primary-key-in-django-dumpdata-with-natural-keys — karthikr, Nov 13 '13 at 18:05
Have you considered using tools like [`this`](http://www.sqlmanager.net/en/products/postgresql/datagenerator) or [`this`](http://databene.org/databene-benerator) to generate mock data? — Ihor Romanchenko, Nov 13 '13 at 18:24

score 1 · Accepted Answer · answered Dec 04 '13 at 13:26

(I) Explaining the principle

In order to illustrate the principle in a clear way, this explanation assumes the following:

every table has a bigserial primary key column called "id"
No unique constraints on tables (except primary keys)
Foreign key constraints reference only primary keys of other tables

Apply following to your database schema:

Make sure there is no circular dependencies between tables in your schema. If there are, choose foreign key constraints that would breake such dependency and drop them (you will later recreate them, after you manually handle affected fields).
Sort tables in topological order and, in that order, for every table execute script from (3)

For every table <table_schema>.<table_name> from (2) execute:

/*
Creating a lookup table which contains ordered pairs (id_old, id_new). 
For every existing row in table <table_schema>.<table_name>, 
new row with id = new_id will be created and with all the other fields copied. Nextval of sequence <table_schema>.<table_name>_id_seq is fetched to reserve id for a new row.
*/
CREATE TABLE _l_<table_schema>_<table_name> AS 
SELECT id as id_old, nextval('<table_schema>.<table_name>_id_seq') as id_new
FROM <table_schema>.<table_name>;

/*
This part is for actual copying of table data with preserving of referential integrity.

Table <table_schema>.<table_name> has the following fields:
id - primary key
column1, ..., columnN - fields in a table excluding the foreign keys; N>=0;
fk1, ..., fkM - foreign keys; M>=0;

_l_<table_schema_fki>_<table_name_fki> (1 <= i <= M) - lookup tables of parent tables. We use LEFT JOIN because foreign key field could be nullable in general case.

*/
INSERT INTO <table_schema>.<table_name> (id, column1, ... , columnN, fk1, ..., fkM)
SELECT tlookup.id_new, t.column1, ... , t.columnN, tablefk1.id_new, ..., tablefkM.id_new
FROM <table_schema>_<table_name> t
INNER JOIN _l_<table_schema>_<table_name> tlookup ON t.id = tlookup.id_old
LEFT JOIN _l_<table_schema_fk1>_<table_name_fk1> tablefk1 ON t.fk1 = tablefk1.id_old
...
LEFT JOIN _l_<table_schema_fkM>_<table_name_fkM> tablefkM ON t.fkM = tablefkM.id_old;

Drop all lookup tables.

(II) Describing my implementation

To check for circular dependencies, I queried the transitive closures (https://beagle.whoi.edu/redmine/projects/ibt/wiki/Transitive_closure_in_PostgreSQL)
I implemented topological sort function (ported from t-sql from some blog). It comes in handy for automations.
I made a code generator (implemented in plpgsql). It's a function which takes <table_schema> and <table_name> as input params and returns text (SQL) shown in (I.2) for that table. By concatenating results of the function for every table in topological order, I produced the copy script.
I made manual changes to the script to satisfy unique constraints and other nuances, which boilerplate script doesn't cover.
Done. Script ready for execution in one transaction.

When I get the chance, I will "anonimize" my code a little bit and put it on github and put a link here.

How to duplicate the amount of data in a PostgreSQL database?

1 Answers1

(I) Explaining the principle

(II) Describing my implementation