I have a 10 million row table in MySQL DB which I need to read, do some validation checks on my client machine and load into a table in postgres database. I am able to successfully get the data into my machine but I am having out of memory issues while trying to process the data and load into the postgres database
Is there a way to use an iterator to process the data in memory and insert into postgres in chunks?
Here is the code I currently have:
from sqlalchemy import create_engine, MetaData, Table
# MySQL database connection
source_engine = create_engine('mysql+pymysql://user:pwd@serveraddress:3306/dbname')
source_connection = engine.connect()
# Read the entire data
data = source_connection.execute('SELECT * FROM table')
# close the MySQL connection
source_connection.close()
# function to transform data
def transform(data):
def process_row(row):
"""do data validation on the row"""
return row
# process and return the incoming dataset as a list of dicts
processed_data = [dict(zip(data.keys(), process_row(d)) for d in data]
return processed_data
transformed_data = transform(data)
# Postgres database connection
dest_connection = create_engine('postgresql://user:pwd@serveraddress:5432/dbname')
dest_meta = MetaData(bind=dest_connection, reflect=True, schema='test')
table = Table('table_name', self.meta, autoload=True)
dest_connection.execute(table.insert().values(transformed_data))
dest_connection.close()
Can anyone suggest a simple way to do this?