Description
I am trying to migrate data from a Pandas DataFrame to a MySQL database table but that data has some inconsistencies that I want to work around though I have not yet figured out a way to. Any help in figuring this out will be very much appreciated.
Example of the data I have:
user_type (table)
code | detail |
---|---|
a | Secretary |
b | Accountant |
user_df (DataFrame with the data I want to migrate to the user table)
id | name | user_type_code (FK: user_type) |
---|---|---|
1 | Jane Doe | a |
2 | John Doe | a |
3 | James Doe | b |
4 | Jeff Doe | c |
5 | Jennifer Doe | d |
As you can notice from the above data, the user_type_code
with values c & d cannot be found in the user_type table.
What I want to achieve is to automatically insert those user_type
missing data with dummy information to accommodate for the need of being corrected in the future and keep all the user records.
user_type table (how I want it to be at the end)
code | detail |
---|---|
a | Secretary |
b | Accountant |
c | Unknown c |
d | Unknown d |
My Current Implementation
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.dialects.mysql import insert
from sqlalchemy.exc import NoReferenceError
# I want to add an implementation of inserting the dummy data in the referenced table (user_type) in this function
def insert_ignore_on_duplicates(table, conn, keys, data_iter):
""" Insert ignore on duplicate primary keys """
try:
insert_stmt = insert(table.table).values(list(data_iter))
on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(
insert_stmt.inserted
)
conn.execute(on_duplicate_key_stmt)
except NoReferenceError as error:
print("Error: {}".format(error))
db_engine = create_engine("mysql+mysqlconnector://username:password@localhost:3306/")
user_df = pd.DataFrame() # Assume this contains all the users' data
user_df.to_sql(
"user",
con=db_engine,
if_exists="append",
index=False,
method=insert_ignore_on_duplicates,
chunksize=5000,
)
I am seeking help to figure out how this insert_ignore_on_duplicates
function/method can be modified to allow the automatic insertion of missing foreign key references or any other approach that can perform that.
Some Related Questions I Found
- Does SQLAlchemy have an equivalent of Django's get_or_create?
- SQLAlchemy Automatically Create Entry If Doesn't Exist As Foreign Key
- Fastest way to insert object if it doesn't exist with SQLAlchemy
P.S. The reason why I need this implementation is because the data is large (>4 million records) and it contains numerous foreign keys that are not present hence cannot practically be checked manually. Adding these primary dummy data will help in keeping all the data and allowing suitable correction in the future, perhaps updating the record c: Unknown c to c: Auditor