I have two object lists: listA<modelA>(), listB<modelB>()
based on the models below.
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
Base = declarative_base()
class modelA(Base):
__tablename__ = "TableA"
rowID = Column(Integer, primary_key=True)
applicationNo = Column(String)
accountNum = Column(String)
sanitizedAccountNum = Column(String)
class modelB(Base):
__tablename__ = "TableB"
rowID = Column(Integer, primary_key=True)
applicationNo = Column(String)
accountNum = Column(String)
sanitizedAccountNum = Column(String)
# create SQLAlchemy engine/connection
engine = create_engine("mysql+mysqlconnector://root:usbw@localhost:3307/testDB", echo=False)
dbSession = sessionmaker(bind=engine)
session = dbSession()
# query to pull data from DB
listA = session.query(modelA).limit(100).all()
listB = session.query(modelB).limit(100).all()
These lists are populated using SqlAlchemy. Each of the tables contains close to a million records each, hence I'm trying to perform the query on a portion of the records at a time.
After fetching data from the DB, I am trying to perform a SQL style left join on the above two lists like the SQL query below:
SELECT a.applicationNo, a.sanitizedAccountNum
FROM listA a
LEFT JOIN listB b on b.applicationNo=a.applicationNo and b.sanitizedAccountNum=a.sanitizedAccountNum
WHERE b.applicationNo IS NULL;
I have tried using Pandas' DataFrame but am unable to get the correct results.
Pandas:
dfA = pd.DataFrame(listA)
dfB = pd.DataFrame(listB)
resultPD = pd.merge(dfA, dfB, how="left"), on=["applicationNo","sanitizedAccountNum"])
The "on" clause here does not work giving me "KeyError: 'applicationNo'". How do I set the "join on" columns in the above query for my models?
Traceback:
Traceback (most recent call last):
File "dbna.py", line 58, in <module>
resultPD = pd.merge(dfA, dfB, indicator="i", how="left", on=["applicationNo","sanitizedAccountNum"])
File "C:\Users\1833\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\reshape\merge.py", line 61, in merge validate=validate)
File "C:\Users\1833\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\reshape\merge.py", line 551, in __init__ self.join_names) = self._get_merge_keys()
File "C:\Users\1833\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\reshape\merge.py", line 857, in _get_merge_keys rk, stacklevel=stacklevel))
File "C:\Users\1833\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 1382, in _get_label_or_level_values raise KeyError(key)
KeyError: 'applicationNo'
Also, is this best way to "left join" listA and listB and get only those records from listA that are not in listB based on the two specific columns mentioned?
Edit (Sample Data): TableA Sample
UPDATE:
As @Philip suggested in his comments below, the trick was to directly bind the DB result to the Pandas DataFrame instead of binding to a list of a class (model) and then creating a DataFrame from that list. This link provided by him in his comments did the trick.