Read from database in chunks with retries

Question

I am reading from a Microsoft SQL server instance. I need to read all data from a table, which is quite big (~4 million records). So I like to do that in chunks, so I can limit the memory usage of my Python program.

This works fine normally, but now I need to move where this runs, which forces it go over a not super stable connection (I believe VPN is sometimes throttling the connection). So occasionally I get a connection error in one of the chunks:

sqlalchemy.exc.OperationalError: (pyodbc.OperationalError) ('08S01', '[08S01] [Microsoft][ODBC Driver 17 for SQL Server]TCP Provider: Error code 0x68 (104) (
SQLGetData)')

The code I run comes down to this:

import pandas as pd
from sqlalchemy import create_engine

connection_string = 'mssql+pyodbc://DB_USER:DB_PASSWORD@DB_HOST/DB_NAME?trusted_connection=no&driver=ODBC+Driver+17+for+SQL+Server'
db = create_engine(connection_string, pool_pre_ping=True)

query = 'SELECT * FROM table'

for chunk in pd.read_sql_query(query, db, chunksize=500_000):
    # do stuff with chunk

What I would like to know: is it possible to add a retry mechanism that can continue with the correct chunk if the connection fails? I've tried a few options, but none of them seem to be able to recover and continue at the same chunk.

Sorry I hadnt quite read the question well. One thing you could is write every chunk run to disk as a parquet file. Using a try except, reread the database excluding primary ids in the parquet, thats if you know the primary key. Parquet will save space,if compressed by primary key, filtering will even be much easier. — wwnde, Apr 01 '22 at 11:37
And you dont have to write the entire file, just write the primary keys. This can even be written as a temp table back to the database. Let me know if this is an option and can help code — wwnde, Apr 01 '22 at 11:44
It is an option to manually iterate over the set (with LIMIT / OFFSET) and keep track of the unique IDs, but that is also annoying since I actually need this for quite a different number of queries (sometimes including subqueries) — arno_v, Apr 01 '22 at 12:29
I hear you. A.durable solution if this is a common occurrence is to venture into Azure/databricks. The two are scalable and offer solutions to read from on prem — wwnde, Apr 01 '22 at 12:36

score 0 · Answer 1 · answered Apr 01 '22 at 11:43

0

query = 'SELECT * FROM table' is a bad practice always filter by the fields you need and process in chunks of 500 records https://www.w3schools.com/sql/sql_top.asp SELECT TOP number|percent column_name(s) FROM table_name WHERE condition;

answered Apr 01 '22 at 11:43

Susmita Mitra

91
1
2

1

That is not the actual query I use, but just an example to keep the code simple :) – arno_v Apr 01 '22 at 12:29

score 0 · Answer 2 · answered Apr 01 '22 at 13:50

I feel your pain. My VPN is the same. I'm not sure if this is a viable solution for you, but you can try this technique.

retry_flag = True
retry_count = 0
cursor = cnxn.cursor()
while retry_flag and retry_count < 5:
    try:
        cursor.execute('SELECT too_id FROM [TTMM].[dbo].[Machines] WHERE MachineID = {}'.format (machineid,))
        too_id = cursor.fetchone()[0]
        cursor.execute('INSERT INTO [TTMM].[dbo].[{}](counter, effectively, too_id) VALUES ({},{},{})'.format (machineid, counter, effectively, too_id,))
        retry_flag = False
        print("Printed To DB - Counter = ", counter, ", Effectively = ", effectively, ", too_id = ", too_id,)

    except Exception as e:
        print (e)
        print ("Retry after 5 sec")
        retry_count = retry_count + 1
        cursor.close()
        cnxn.close()
        time.sleep(5)
        cnxn = pyodbc.connect('DRIVER=FreeTDS;SERVER=*;PORT=*;DATABASE=*;UID=*;PWD=*;TDS_Version=8.7;', autocommit=True)
        cursor = cnxn.cursor()

cursor.close()

How to retry after sql connection failed in python?

Read from database in chunks with retries

2 Answers2