1

I have two object lists: listA<modelA>(), listB<modelB>() based on the models below.

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String

Base = declarative_base()

class modelA(Base):
    __tablename__ = "TableA"

    rowID = Column(Integer, primary_key=True)
    applicationNo = Column(String)
    accountNum = Column(String)
    sanitizedAccountNum = Column(String)

class modelB(Base):
    __tablename__ = "TableB"

    rowID = Column(Integer, primary_key=True)
    applicationNo = Column(String)
    accountNum = Column(String)
    sanitizedAccountNum = Column(String)

# create SQLAlchemy engine/connection
engine = create_engine("mysql+mysqlconnector://root:usbw@localhost:3307/testDB", echo=False)
dbSession = sessionmaker(bind=engine)
session = dbSession()

# query to pull data from DB 
listA = session.query(modelA).limit(100).all()
listB = session.query(modelB).limit(100).all()

These lists are populated using SqlAlchemy. Each of the tables contains close to a million records each, hence I'm trying to perform the query on a portion of the records at a time.

After fetching data from the DB, I am trying to perform a SQL style left join on the above two lists like the SQL query below:

SELECT a.applicationNo, a.sanitizedAccountNum
FROM listA a
LEFT JOIN listB b on b.applicationNo=a.applicationNo and b.sanitizedAccountNum=a.sanitizedAccountNum
WHERE b.applicationNo IS NULL;

I have tried using Pandas' DataFrame but am unable to get the correct results.

Pandas:

dfA = pd.DataFrame(listA)
dfB = pd.DataFrame(listB)
resultPD = pd.merge(dfA, dfB, how="left"), on=["applicationNo","sanitizedAccountNum"])

The "on" clause here does not work giving me "KeyError: 'applicationNo'". How do I set the "join on" columns in the above query for my models?

Traceback:

Traceback (most recent call last):
  File "dbna.py", line 58, in <module>
    resultPD = pd.merge(dfA, dfB, indicator="i", how="left", on=["applicationNo","sanitizedAccountNum"])
  File "C:\Users\1833\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\reshape\merge.py", line 61, in merge validate=validate)
  File "C:\Users\1833\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\reshape\merge.py", line 551, in __init__ self.join_names) = self._get_merge_keys()
  File "C:\Users\1833\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\reshape\merge.py", line 857, in _get_merge_keys rk, stacklevel=stacklevel))
  File "C:\Users\1833\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 1382, in _get_label_or_level_values raise KeyError(key)
KeyError: 'applicationNo'

Also, is this best way to "left join" listA and listB and get only those records from listA that are not in listB based on the two specific columns mentioned?

Edit (Sample Data): TableA Sample

TableB Sample

UPDATE:

As @Philip suggested in his comments below, the trick was to directly bind the DB result to the Pandas DataFrame instead of binding to a list of a class (model) and then creating a DataFrame from that list. This link provided by him in his comments did the trick.

Pranav S
  • 425
  • 1
  • 5
  • 14
  • 1
    Is there a reason why you cannot perform the (left) join in SQL? Having a million records in each table should not be an issue, though the query probably will take some time. If `listA` is a list of SQLAlchemy entity objects, then `dfA` will have a single object column. Please produce a [mcve] and include the full traceback. – Ilja Everilä Dec 13 '18 at 05:23
  • Tried performing it in MySQL which was taking over 40 hours and still going, hence doing it DB side is not an option. My eventual goal is to running a few threads with 100 thousand records each. Is there a way to access the columns from the dfA single object column? I have also edited my post and added Traceback for the KeyError – Pranav S Dec 13 '18 at 06:00
  • Could you post a few rows of sample data from the two tables? – Philip Dec 13 '18 at 08:23
  • Fun fact: Postgresql finishes such a query, with ~million rows in both tables, in about a second (using version 10). Am waiting for MySQL 8 to finish. – Ilja Everilä Dec 13 '18 at 10:25
  • Things that might help MySQL actually get there: make the columns used in the join `NOT NULL` and create an index on both tables that covers the columns used in the `ON` clause. With such modifications the query finished in ~5s on this machine on MySQL (with the million rows in both tables). – Ilja Everilä Dec 13 '18 at 11:52
  • Both tables have indices on the joined columns and was running on a server but it was still taking an extremely long time. Plus I wanted to learn more about python, so thought this would be a good opportunity. – Pranav S Dec 14 '18 at 05:03

1 Answers1

1

One suggestion could be that you create a view either in MySql or as a query, and then use that view with a record limit or by specifying chunksize in pandas.

Create VIEW in database:

CREATE VIEW AB_joined AS
    SELECT a.applicationNo
        ,a.sanitizedAccountNum
    FROM listA a
    LEFT JOIN listB b ON b.applicationNo = a.applicationNo
        AND b.sanitizedAccountNum = a.sanitizedAccountNum
    WHERE b.applicationNo IS NULL

and use query1 in pandas:

query1 = "SELECT * FROM AB_joined"

or use just use query2 directly in pandas:

query2 = """
SELECT a.applicationNo
        ,a.sanitizedAccountNum
    FROM listA a
    LEFT JOIN listB b ON b.applicationNo = a.applicationNo
        AND b.sanitizedAccountNum = a.sanitizedAccountNum
    WHERE b.applicationNo IS NULL"""

Then use pandas to read chunksize, do you thing and merge the different chunksizes together.

result = pd.read_sql_query(query, engine, chunksize=100000)

You can find more about pandas.read_sql_query here

Another suggestion is to create the view directly with sqlalchemy and do what you did above. In my mind choice depends on the purpose of the project. You might find inspiration to create views in sqlalchemy here

Your first question. I think the query should look like this:

resultPD = dfA.merge(dfB, left_on="applicationNo", right_on="sanitizedLoanAccount", how="left")

Your second question. A left join is the way to get only records from listA that are not in listB. You are also utilizing a where clause, which adds additional rules for which rows should be selected.

UPDATE I

I just realized that your data is stored as strings. It is not good practice to join data on string values. If possible I would suggest converting numbers stored as strings to integers. This could help avoid a lot of issues.

UPDATE II - added data

I've tried with the data you made a screen shot from. Just using two rows from each.

dfA = pd.DataFrame({
    'RowID' : [1,2],
    'ApplicationNo': ['L0008065026','L000969215'],
    'AccountNum': ['34204731277', '006737107100039'],
    'SanatizedAccountNum': ['34204731277', '6737107100039']
    }) 

dfB = pd.DataFrame({
    'RowID' : [1,2],
    'ApplicationNo': ['L43907','L52006'],
    'AccountNum': ['3265470064', '073176310000477'],
    'SanatizedAccountNum': ['3265470064', '73176310000477']
    }) 

resultPD = dfA.merge(dfB, left_on="ApplicationNo", right_on="SanatizedAccountNum", how="left")

With the above I have no problem getting a resultPD.

Philip
  • 944
  • 11
  • 26
  • When trying this suggestion of yours: `resultPD = dfA.merge(dfB, left_on="applicationNo", right_on="sanitizedAccountNum", how="left")`, I am getting a `KeyError: 'sanitizedAccountNum'`. It gives me a KeyError on right_on no matter what column name I use. – Pranav S Dec 13 '18 at 07:10
  • I need to see a few rows of the data from each table. Did you try the first suggestion? – Philip Dec 13 '18 at 08:26
  • I have added an edit the main post with the snapshots of the tables' sample data. Was not able to format it correctly in a tabular format in the comment. – Pranav S Dec 13 '18 at 09:22
  • I'm trying to work on the first suggestion; working on figuring out how to iterate through the result from it which is a "SQLDatabase._query_iterator" object. – Pranav S Dec 13 '18 at 10:18
  • So, does that mean I cannot just create a DataFrame by doing `pd.DataFrame(listA)`; I have to create a list like the one you mentioned in your update using a loop first? – Pranav S Dec 13 '18 at 10:28
  • 1
    Have a look here, I think that answers your question: [link](https://stackoverflow.com/questions/29525808/sqlalchemy-orm-conversion-to-pandas-dataframe) – Philip Dec 13 '18 at 10:32