How to batch write a table that has 10 million plus records into separate gzip files-Python/Postgres/Airflow

Question

I have a table that has 10 million plus records(rows) in it. I am trying to do a one-time load into s3 by select *'ing the table and then writing it to a gzip file in my local file system. Currently, I can run my script to collect 800,000 records into the gzip file but then I receive an error, and the remainder records are obviously not inserted.

Since there is no continuation in sql (for example- if you run 10 limit 800,000 queries, it wont be in order).

So, is there a way to writer a python/airflow function that can load the 10 million+ table in batches? Perhaps theres a way in python where I can do a select * statement and continue the statement after x amount of records into separate gzip files?

Here is my python/airflow script so far that when ran, it only writers 800,000 records to the path variable:

def gzip_postgres_table(table_name, **kwargs):
    path = '/usr/local/airflow/{}.gz'.format(table_name)
    server_post = create_tunnel_postgres()
    server_post.start()
    etl_conn = conn_postgres_internal(server_postgres)

    record = get_etl_record(kwargs['master_table'],
                              kwargs['table_name'])
    cur = etl_conn.cursor()

    unload_sql = '''SELECT *
                        FROM schema1.database1.{0} '''.format(record['table_name'])

    cur.execute(unload_sql)
    result = cur.fetchall()
    column_names = [i[0] for i in cur.description]
    fp = gzip.open(path, 'wt')
    myFile = csv.writer(fp, delimiter=',')
    myFile.writerow(column_names)
    myFile.writerows(result)
    fp.close()
    etl_conn.close()
    server_postgres.stop()

Might be interesting, continuously fetching a smaller amount of data: https://stackoverflow.com/questions/60171679/fetching-data-from-postgres-database-in-batch-python — Maurice Meyer, Feb 13 '20 at 20:37

Dan · Answer 1 · 2020-02-13T18:32:51.820

The best, I mean THE BEST approach to insert so many records into PostgreSQL, or to get them form PostgreSQL, is to use postgresql COPY. This means you would have to change your approach drastically, but there's no better way that I know in PostgreSQL. COPY manual

COPY creates a file with the query you are executing or it can insert into a table from a file.

COPY moves data between PostgreSQL tables and standard file-system files.

The reason why is the best solution is because your using PostgreSQL default method to handle external data, without intermediaries; so it's fast and secure.

COPY works like a charm with CSV files. You should change your approach to a file handling method and the use of COPY.

Since COPY runs with SQL, you can divide your data using LIMIT and OFFSET in the query. For example:

COPY (SELECT * FROM country LIMIT 10 OFFSET 10) TO '/usr1/proj/bray/sql/a_list_of_10_countries.copy';
-- This creates 10 countries starting in the row 10

COPY only works with files that are accessible with PostgreSQL user in the server.

PL Function (edited):

If you want COPY to be dynamic, you can use the COPY into a PL function. For example:

CREATE OR REPLACE FUNCTION copy_table(
    table_name text,
    file_name text,
    vlimit text,
    voffset text
)RETURNS VOID AS $$
DECLARE
    query text;
BEGIN
    query := 'COPY (SELECT * FROM country LIMIT '||vlimit||' OFFSET '||voffset||') TO '''||file_name||''' DELIMITER '','' CSV';
-- NOTE that file_name has to have its dir too.
    EXECUTE query;
END;$$ LANGUAGE plpgsql;
SECURITY DEFINER
LANGUAGE plpgsql;

To execute the function you just have to do:

SELECT copy_table('test','/usr/sql/test.csv','10','10')

Notes:

If the PL will be public, you have to check for SQL injection attacks.
You can program the PL to suit your needs, this is just an example.
The function returns VOID, so it just do the COPY, if you need some feedback you should return something else.
The function has to be owned with user postgres from the server, because it needs file access; that is why it needs SECURITY DEFINER, so that any database user can run the PL.

Do you know how to copy the table to a csv file in 1million increments to a different file each time? — Coder123, Feb 13 '20 at 17:41
For that you can create a PL function. There you can create the file with a variable. — Dan, Feb 13 '20 at 18:05

How to batch write a table that has 10 million plus records into separate gzip files-Python/Postgres/Airflow

1 Answers1