2

I need to insert 46mln points into PostGIS database in a decent time. Inserting 14mln points was executing around 40 minutes, it its awful and inefficient.

I created database with spatial GIST index and wrote this code:

import psycopg2
import time

start = time.time()

conn = psycopg2.connect(host='localhost', port='5432', dbname='test2', user='postgres', password='alfabet1')

filepath = "C:\\Users\\nmt1m.csv"
curs = conn.cursor()
with open(filepath, 'r') as text:
    for i in text:
        i = i.replace("\n", "")
        i = i.split(sep=" ")
        curs.execute(f"INSERT INTO nmt_1 (geom, Z) VALUES (ST_GeomFromText('POINTZ({i[0]} {i[1]} {i[2]})',0), {i[2]});")
conn.commit()
end = time.time()

print(end - start)

curs.close()
conn.close()

Im looking for the best way to inserting data, it not must be in python.

Thanks ;)

stolarinho
  • 65
  • 5
  • 3
    Have a look at https://stackoverflow.com/a/57337636/7216865 or https://www.psycopg.org/docs/cursor.html#cursor.copy_from – Maurice Meyer Feb 05 '21 at 21:28

2 Answers2

2

Cześć! Welcome to SO.

There are a few things you can do to speed up your bulk insert:

  1. If the target table is empty or is not being used in a production system, consider dropping the indexes right before inserting the data. After the insert is complete you can recreate them. This will avoid PostgreSQL to re-index your table after every insert, which in your case means 46 million times.

  2. If the target table can be entirely built from your CSV file, consider creating an UNLOGGED TABLE. Unlogged tables are much faster than "normal" tables, since they (as the name suggests) are not logged in the WAL file (write-ahead log). Unlogged tables might be lost in case of database crash or an unclean shutdown!

  3. Use either the PostgreSQL COPY command or copy_from as @MauriceMeyer pointed out. If for some reason you must stick to inserts, make sure you're not committing after every insert ;-)

Cheers

Jim Jones
  • 18,404
  • 3
  • 35
  • 44
2

Thanks Jim for help, according to your instructions better way to insert data is:

import psycopg2
import time

start = time.time()
conn = psycopg2.connect(host='localhost', port='5432', dbname='test2',
user='postgres', password='alfabet1')
curs = conn.cursor()
filepath = "C:\\Users\\Jakub\\PycharmProjects\\test2\\testownik9_NMT\\nmt1m.csv"

curs.execute("CREATE UNLOGGED TABLE nmt_10 (id_1 FLOAT, id_2 FLOAT, id_3 FLOAT);")

with open(filepath, 'r') as text:
     curs.copy_from(text, 'nmt_10', sep=" ")

curs.execute("SELECT AddGeometryColumn('nmt_10', 'geom', 2180, 'POINTZ', 3);")
curs.execute("CREATE INDEX nmt_10_index ON nmt_10 USING GIST (geom);")
curs.execute("UPDATE nmt_10 SET geom = ST_SetSRID(ST_MakePoint(id_1, id_2, id_3), 2180);")

conn.commit()
end = time.time()
print(end - start)

cheers

stolarinho
  • 65
  • 5