1

I am reading tables as SELECT * FROM TABLE (sql); from an ODBC data source via PyODBC and fetching/loading all the rows using Pandas read_sql(). However, there are 200+ tables and some have 100,000 rows so for that have been using chunksize to read and load to dataframes to gain some read performance.

Below is a sample code:

def get_odbc_tables(dsn,uid,pwd)
  try:
       cnxn = pyodbc.connect('DSN={};UID={};PWD={}'.format(dsn, uid, pwd), autocommit=True)
        # Get data into pandas dataframe
        dfl = []  
        df = pd.DataFrame() 
        for chunk in pd.read_sql(sql, cnxn, chunksize=10000):
            dfl.append(chunk)
            df = pd.concat(dfl, ignore_index=True)
            records = json.loads(df.T.to_json()).values()
            print("Load to Target")
            ......
            cnxn.close()
  except Exception as e:
        print("Error: {}".format(str(e)))
        sys.exit(1)

However, I am always getting this error after pandas has read/processed specified chunksize (10,000) as defined in the read_sql and loaded to target:

Error: The cursor's connection has been closed

If chunksize is increased to 50,000; it errors out again with same above error message once it has processed/loaded just 50,000 records, even though source has more records than this. This is also causing program failure.

C:\Program Files (x86)\Python\lib\site-packages\pandas\io\sql.py in _query_iterator(cursor, chunksize, columns, index_col, coerce_float, parse_dates)
   1419         while True:
-> 1420             data = cursor.fetchmany(chunksize)
   1421             if type(data) == tuple:

ProgrammingError: The cursor's connection has been closed.

During handling of the above exception, another exception occurred:

SystemExit                                Traceback (most recent call last)
<ipython-input-127-b106daee9737> in <module>()

Please suggest if there's any way to handle this. The source is an ODBC data source connection only, hence I think can't create an SQLAlchemy engine for an ODBC data sources.

James Z
  • 12,209
  • 10
  • 24
  • 44
ManiK
  • 377
  • 1
  • 21
  • 4
    Move the `cnxn.close()` outside the `for` loop. – Gord Thompson Jul 24 '20 at 19:53
  • Aah !! My bad. such a silly mistake... Thanks for pointing that out. Any opinion or suggestions on the fastest way to read entire set of table(s) from an ODBC data source? Have evaluated both pyodbc and turbodbc but reads were almost same on a table with 100,000 records. Target would be MongoDB though. – ManiK Jul 24 '20 at 20:14
  • 1
    First, [Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying](https://stackoverflow.com/a/36489724/1422451). Second, avoid the heavy library called `pandas`. Use it for data analytics not migration. You can easily build list of dicts or dicts of lists (i.e., json) from pyodbc cursors. Please explain fully your desired needs. – Parfait Jul 24 '20 at 21:51
  • Thanks for the suggestion. I did a quick evaluation against current method by using pyodbc cursors exported as dicts :- https://stackoverflow.com/questions/16519385/output-pyodbc-cursor-results-as-python-dictionary/16523148#16523148. There was an improvement of 2-3 minutes in reading one of the large tables. Appreciate it !! – ManiK Jul 25 '20 at 11:19

0 Answers0