0

I am trying to create a small subset of my large MSSQL Server data bases on my local disk in MySQL for analytics/ML using Python. I successfully read the data into Pandas dataframes, convert all the columns with object data type to string so that I am only dealing with strings, float64, int64 and bool data types. Then I write them to MySQL using dataframe.to_sql. I am using SQLAlchemy to create the engine and I have Mysql-python-connector to do everything in Python.

Problem: When writing to MySQL local database, it will create all tables, but some tables end up empty and some tables have missing rows. If I write the same dataframe to SQLite, I wouldn't have any of these issues. Increasing chunchsize helped to reduce the missing rows and Increasing pool_recycle helped to reduce the empty tables but didn't completely resolved either issue. Especially there are a few specific tables which are never populated. Analyzing tables in MySQL shows utf8mb4_0900_ai_ci as table collation which basically means it is using utf8mb4 characterset. Tables are not huge and the biggest I have is less than 200K rows and 50 columns. Process doesn't take long, less than 10 minutes for reading and writing all 22 tables.

Code:

import pandas as pd
import pyodbc 
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
import mysql.connector
import gc

def update_mysql_db(local_db_name:"database filename",
                    tables_to_update:"lists names of all the tables and their data"):

    '''Receives a list of pairs in the form of 
    (table name, table data in a datframe).
    Creates new tables in the local database with the data provided'''




    print(f"{len(tables_to_update)} tables will be updated/created")
    for table in tables_to_update:
        print('\n')
        print(f"reading {table[0]} dataframe")        

        try:
            table[1].to_sql(table[0].lower(),
                           create_engine('mysql+mysqlconnector://demo:Demo@localhost:3306/sqlalchemy',
                                        pool_recycle=36000, echo = False),
                            if_exists='replace',
                            index='True',
                            chunksize=5000,
                           index_label ='Index')

            print('Table {}.{} Created'.format(local_db_name,table[0].lower()))
        except ValueError as vx:
             print('Value Error : \n\n  ', vx)
        except Exception as ex:
              print('Exception : \n\n  ', ex)        


    gc.collect() 
    print('\n')
    print("Finished updating the local database")
    return None

def main():

    Local_db = 'sqlalchemy'
    update_mysql_db(Local_db,list_of_tablesnames_and_data)
    gc.collect()



if __name__ == "__main__":
    main()

Errors I get:

reading SeminarEvaluationResponses dataframe
Exception during reset or similar
Traceback (most recent call last):
  File "C:\Sam\Anaconda\lib\site-packages\sqlalchemy\pool\base.py", line 693, in _finalize_fairy
    fairy._reset(pool)
  File "C:\Sam\Anaconda\lib\site-packages\sqlalchemy\pool\base.py", line 880, in _reset
    pool._dialect.do_rollback(self)
  File "C:\Sam\Anaconda\lib\site-packages\sqlalchemy\dialects\mysql\base.py", line 2302, in do_rollback
    dbapi_connection.rollback()
  File "C:\Sam\Anaconda\lib\site-packages\mysql\connector\connection_cext.py", line 386, in rollback
    self._cmysql.rollback()
_mysql_connector.MySQLInterfaceError: MySQL server has gone away
Exception : 

   MySQL server has gone away

What Else: Unfortunately I can't upload any actual data/databases/tables here to make the code workable. I am hopping some of you gurus out there can give me a couple of hints or point me to the right direction.

Meisam H
  • 67
  • 1
  • 8

1 Answers1

1

It looks the process might take too long and times out the connection. Based on the time frame you're giving me for how long it takes to write this data, this seems to be the case. There are two options that aren't mutually exclusive, so you can do both if needed. The first is to increase the wait_timeout on for your MySQL server. You can do this in Python or MySQL Workbench, and this answer tells you how to do it both ways.

The second option, which is a little more involved, is that you can switch from using SQLAlchemy to directly using pyodbc to perform the insert operations. That's not faster in itself, but it allows you to turn the DataFrame you are trying to push into a list of tuples, which is faster to push in an insert query than it is to push a DataFrame with .to_sql(). There might be some additional coding to do for things that SQLAlchemy handles automatically, but it is an option for increasing your performance. That said, I would highly recommend trying the first by itself to make sure that this would even be necessary before you try incorporating this strategy.

For the missing rows, I don't really know why a set of records would only be partially inserted into a table. Is autocommit set to True anywhere?

Lastly, this answer covers a lot of ground an I didn't accurately assess the issue.

LTheriault
  • 1,180
  • 6
  • 15
  • Thanks @LTheriault. It worked pretty well and solved most of my problems. Using the first method, I increased the connection/read/write and packet read/write timeout to 36000, then increased Max Allowed Packet to 32M using the workbench. When keeping `chunksize` in `to_sql()` below 3,000, I have no empty tables and the number of missing rows are drastically reduced, but not completely gone. They are even in the tiny tables with 1,000 rows and 10 columns. Optimizing `cash size`, `timeout` and `cuncksize` doesn't seem to do more. Can there be something else in play, datatype/length? – Meisam H Apr 21 '20 at 17:32
  • I think you're on the right track and I'd check data types next too if I were in your position. Character length requirements on string columns are frequent suspects for that kind of thing, but validating all of your columns before you push data saves headaches and gives you a good space to put error-handling/logging for debugging in the future. – LTheriault Apr 21 '20 at 18:36