I'm looking for an efficient way to import data from a CSV file to a Postgresql table using python in batches as I have quite large files and the server I'm importing the data to is far away. I need an efficient solution as everything I tried was either slow or just didn't work. I'm using SQLlahcemy. I wanted to use raw SQL but it's so hard to parameterize and I need multiple loops to execute the query for multiple rows
-
Use COPY: https://stackoverflow.com/questions/13125236/sqlalchemy-psycopg2-and-postgresql-copy – Ilja Everilä Jul 12 '20 at 10:36
-
You first complain the everything you try is to slow but then state you need "multiple loops to execute the query". Almost by definition this is the slowest possible solution. Post a complete description of the problem, table definition and sample data, at text - **no images**. The community most likely devise a better solution, but we need that information. I guess it is a matter of what you are most used to but SQL is about the easiest thing to parameterize: name the columns and make sure values are in correct sequence. But that's just my opinion and experience. – Belayer Jul 12 '20 at 18:24
2 Answers
I was given the task of manipulating & migrating some data from CSV files into a remote Postgres Instance.
I decided to use the Python script below:
import csv
import uuid
import psycopg2
import psycopg2.extras
import time
#Instant Time at the start of the Script
start = time.time()
psycopg2.extras.register_uuid()
#List of CSV Files that I want to manipulate & migrate.
file_list=["Address.csv"]
conn = psycopg2.connect("host=localhost dbname=address user=postgres password=docker")
cur = conn.cursor()
i = 1
for f in file_list:
f = open(f)
csv_f = csv.reader(f)
next(csv_f)
for row in csv_f:
# Some simple manipulations on each row
#Inserting a uuid4 into the first column
row.pop(0)
row.insert(0,uuid.uuid4())
row.pop(10)
row.insert(10,False)
row.pop(13)
#Tracking the number of rows inserted
print(i)
i = i + 1
#INSERT QUERY
postgres_insert_query = """ INSERT INTO "public"."address"("address_id","address_line_1","locality_area_street","address_name","app_version","channel_type","city","country","created_at","first_name","is_default","landmark","last_name","mobile","pincode","territory","updated_at","user_id") VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
record_to_insert = row
cur.execute(postgres_insert_query,record_to_insert)
f.close()
conn.commit()
conn.close()
print(time.time()-start)
The script worked quite well and promptly when testing it locally. But connecting to a remote Database Server added a lot more latency.
As a workaround, I migrated the manipulated data into my local postgres instance. I then generated a .sql file of the migrated data & manually imported the .sql file on the remote server.
Alternatively, you can also use Python's Multithreading features, to launch multiple concurrent connections to the remote server and dedicate an isolated batch process to each connection, and flush the data. This should make your migration considerably faster.
I have personally not tried the multi threading approach as it wasn't required in my case. But it seems darn efficient.
Hope this helped ! :)
Resources: CSV Manipulation using Python for Beginners.

- 57
- 2
- 5
use copy_from command, it copies all the rows to table.
path=open('file.csv','r')
next(path)
cur.copy_from(path,'table_name',columns=('id','name','email'))

- 104
- 1
- 5