0

≈105 seconds per 1 million rows to insert into Postgresql local database on table with 2 indexes and 4 columns it is slow or fast ?

Python Code:

import os 
import pandas as pd 
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import create_engine

num =  32473068
batch = 1000000

def main(data):
    engine = create_engine('postgresql://***:****' + host + ':5432/kaggle')
    data.to_sql(con=engine, name=tbl_name, if_exists='append', index=False)

for i in range(0, num, batch):
    data = pd.read_csv(data_path+'app_events.csv', skiprows=i, nrows=batch)
    data.columns = ['event_id', 'app_id', 'is_installed', 'is_active']
    data = data.reset_index(drop=True)
    batchSize = 10000
    batchList = [data.iloc[x:x + batchSize].reset_index(drop=True) for x in range(0, len(data), batchSize)]
    with ThreadPoolExecutor(max_workers=30) as executor:
        future_to_url = {executor.submit(main, d): d for d in batchList}
        for k, future in enumerate(as_completed(future_to_url)):
            url = future_to_url[future]
  • 1
    Postgres has special command for import CSV files `COPY` - nothing should be faster. – Pavel Stehule Jul 14 '16 at 09:05
  • 1
    pg_bulkload is actually faster – d1ll1nger Jul 14 '16 at 09:27
  • In addition to @user2189731's excellent point about enabling **use_batch_mode=True** in your SQLAlchemy in your engine instantiation I would suggest that you skip multi-threading for this task. That's unlikely to provide any beneficial concurrency and probably loses far more to lock contention then you'd gain. – Jim Dennis Jan 02 '18 at 02:12

1 Answers1

2

It depends on your hardware too. As a reference, my old I5 laptop with HDD uses ~300s to insert 0.1M rows(roughly 200-300 Mega bytes).

I learned from other similar questions that to split big values into bulks when using insert() command could speed up. Since you're using Pandas I assume it has certain optimization already. But I suggest you to make a quick test to see if it helps too.

  • Pandas actually used non-optimized insert command. See (to_sql + sqlalchemy + copy from + postgresql engine?). So bulk insert or other methods should be used to improve performance.

  • SQLalchemy 1.2 uses bulk insert when you initialize your engine with "use_batch_mode=True" parameter. I saw 100X speedup on my I5+HDD laptop! Meaning with 0.1M record, originally it took me 300s and now it's 3s!!. If you computer is better than mine, I bet you could see this tremendous speedup with your 1M records.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user2189731
  • 558
  • 8
  • 15