1

I am trying to use Pandas' df.to_sql and SQlite3 in python to put about 2GB of data with about 16million rows in a database. My strategy has been to chunk the original CSV into smaller dataframes, perform some operations on them, and then throw them into an SQL database.

As I run this code, it starts out fast but quickly slows down. After about 3 million rows it slows down to such a degree as it doesn't seem like it will finish in any realistic amount of time. What is the cause of this and what can I do about it? My code is below:

def chunk_read_CSV_to_db(database, table, filepath, chunksize, delimiter=','):
    start = dt.datetime.now()
    conn = sqlite3.connect(database)
    index_start = 1
    j=0
    for df in pd.read_csv(filepath, chunksize=chunksize, iterator=True, encoding='utf-8', sep=delimiter):
        j+=1
        print '{} seconds: complete {} rows'.format((dt.datetime.now() -start).seconds, j*chunksize)
        df.to_sql(name=table, con=conn, flavor='sqlite', if_exists='append')
    conn.close()

db_name = 'store_data.db'
f9 = 'xrf_str_geo_ta4_1511.txt'
chunksize = 20000
chunk_read_CSV_to_db(os.path.join(fp, db_name), os.path.splitext(f9)[0], os.path.join(fp, f9), chunksize = chunksize, delimiter='\t')
Alexander Moore
  • 467
  • 2
  • 5
  • 13
  • 3
    Do you have any indexes in your table? It'd be wise to remove the indexes, add the data and create the indexes afterwards. See http://stackoverflow.com/questions/788568/sqlite3-disabling-primary-key-index-while-inserting – Josep Valls Jan 12 '16 at 05:44

2 Answers2

2

Solution: set df.to_sql argument to index = False

So I know this answer will no longer be relevant to the author, but I stumbled across it because I had exactly the same problem and wanted to share my answer.

I was trying to load ~900 .csv files into an sql database one by one, using the append method. The loading started fast, but slowed down exponentially and never finished running. This made me suspect there was something going on wrong involving indexing (i.e. pandas was somehow re-indexing things every time I was appending data) because that's the only thing I could think of to explain the slowdown (memory seemed to be fine).

Eventually I started using the sqlite3 .index and .dbinfo methods through the command line to look at databases created through pandas compared to those compared through sqlite3 directly. What I found is that pandas databases had 1 index compared to 0 when processed through sqlite3. Also, the schema size was way bigger.

Now, pandas to_sql method has an index argument. It says that this argument simply adds the dataframe index as a column in the database (which sounds innocuous enough). But it turns out that it also uses that column as a database index, and it seems like if you're using the append method then maybe it recalculates this index every time (or something). Regardless, when I set the index argument to False, .dbinfo shows 0 indexes in the resulting dataframe, and my problem disappeared - all the data was processed in a very short time.

So the solution would be:

df.to_sql(name=table, con=conn, flavor='sqlite', if_exists='append', index = False)
Sean Murphy
  • 1,217
  • 8
  • 15
1

I switched over to sqlalchemy and had no problems with time after that. There is no noticeable slowdown. The code is below.

def chunk_read_CSV_to_db(database, table, filepath, chunksize, delimiter=',', index=False):
     start = dt.datetime.now()
     index_start = 1
     j=0
     for df in pd.read_csv(filepath, chunksize=chunksize, iterator=True, encoding='utf-8', sep=delimiter):
         j+=1
         print '{} seconds: complete {} rows'.format((dt.datetime.now() -start).seconds, j*chunksize)
         df.to_sql(table, db, flavor='sqlite', if_exists='append', index=index)

db = create_engine('sqlite:///store_data.db')
meta = MetaData(bind=db)

table_pop = Table('xrf_str_geo_ta4_1511', meta, 
    Column('TDLINX',Integer, nullable=True),
    Column('GEO_ID',Integer, nullable=True),
    Column('PERCINCL', Numeric, nullable=True)
)

chunksize = 20000
chunk_read_CSV_to_db(db,'xrf_str_geo_ta4_1511', os.path.join(fp, f9), chunksize = chunksize, delimiter='\t')        
Alexander Moore
  • 467
  • 2
  • 5
  • 13