I am trying to create a small subset of my large MSSQL Server data bases on my local disk in MySQL for analytics/ML using Python. I successfully read the data into Pandas dataframes, convert all the columns with object data type to string so that I am only dealing with strings, float64, int64 and bool data types. Then I write them to MySQL using dataframe.to_sql
. I am using SQLAlchemy
to create the engine and I have Mysql-python-connector
to do everything in Python.
Problem: When writing to MySQL local database, it will create all tables, but some tables end up empty and some tables have missing rows. If I write the same dataframe to SQLite, I wouldn't have any of these issues. Increasing chunchsize
helped to reduce the missing rows and Increasing pool_recycle
helped to reduce the empty tables but didn't completely resolved either issue. Especially there are a few specific tables which are never populated. Analyzing tables in MySQL shows utf8mb4_0900_ai_ci
as table collation which basically means it is using utf8mb4
characterset. Tables are not huge and the biggest I have is less than 200K rows and 50 columns. Process doesn't take long, less than 10 minutes for reading and writing all 22 tables.
Code:
import pandas as pd
import pyodbc
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
import mysql.connector
import gc
def update_mysql_db(local_db_name:"database filename",
tables_to_update:"lists names of all the tables and their data"):
'''Receives a list of pairs in the form of
(table name, table data in a datframe).
Creates new tables in the local database with the data provided'''
print(f"{len(tables_to_update)} tables will be updated/created")
for table in tables_to_update:
print('\n')
print(f"reading {table[0]} dataframe")
try:
table[1].to_sql(table[0].lower(),
create_engine('mysql+mysqlconnector://demo:Demo@localhost:3306/sqlalchemy',
pool_recycle=36000, echo = False),
if_exists='replace',
index='True',
chunksize=5000,
index_label ='Index')
print('Table {}.{} Created'.format(local_db_name,table[0].lower()))
except ValueError as vx:
print('Value Error : \n\n ', vx)
except Exception as ex:
print('Exception : \n\n ', ex)
gc.collect()
print('\n')
print("Finished updating the local database")
return None
def main():
Local_db = 'sqlalchemy'
update_mysql_db(Local_db,list_of_tablesnames_and_data)
gc.collect()
if __name__ == "__main__":
main()
Errors I get:
reading SeminarEvaluationResponses dataframe
Exception during reset or similar
Traceback (most recent call last):
File "C:\Sam\Anaconda\lib\site-packages\sqlalchemy\pool\base.py", line 693, in _finalize_fairy
fairy._reset(pool)
File "C:\Sam\Anaconda\lib\site-packages\sqlalchemy\pool\base.py", line 880, in _reset
pool._dialect.do_rollback(self)
File "C:\Sam\Anaconda\lib\site-packages\sqlalchemy\dialects\mysql\base.py", line 2302, in do_rollback
dbapi_connection.rollback()
File "C:\Sam\Anaconda\lib\site-packages\mysql\connector\connection_cext.py", line 386, in rollback
self._cmysql.rollback()
_mysql_connector.MySQLInterfaceError: MySQL server has gone away
Exception :
MySQL server has gone away
What Else: Unfortunately I can't upload any actual data/databases/tables here to make the code workable. I am hopping some of you gurus out there can give me a couple of hints or point me to the right direction.