5

I have a Python process that uses SQLAlchemy to insert some data into a MS SQL Server DB. When the Python process runs it hangs during the insert. I turned on SQLAlchemy logging to get some more information. I found that it hangs at this point where SQLAlchemy seems to be requesting table schema info about the entire DB:

2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')

I have other "stuff" going on in the DB at this time, including some open transactions and my guess is that for whatever reason querying [INFORMATION_SCHEMA].[TABLES] creates some deadlock or blocks somehow.

I've also read (here) that [INFORMATION_SCHEMA].[TABLES] is a view that cannot cause a deadlock which would contradict my guess of what is causing this issue.

My question is: Can I alter the configuration/settings of SQLAlchemy so that it does not make this query in the first place?

UPDATE 1: The Python code for the insert is like this:

with sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params).connect() as connection:
    # df is a Pandas DataFrame
    df.to_sql(name=my_table, con=connection, if_exists='append', index=False)

Note that the code works without any problems when I run the Python script during other times of the day when I don't have those other DB transactions going on. In those cases, the log continues immediately like this, listing all the tables in the DB:

2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._init_metadata(result.py:810) DEBUG: Col ('TABLE_NAME',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table1',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table2',)
...

UPDATE 2: Apparently when a table or other object is created in an open transaction and not committed yet, querying [INFORMATION_SCHEMA].[TABLES] will get blocked (source). Is anyone familiar with the internals of SQLAlchemy to suggest how to prevent it from making this query in the first place?

UPDATE 3: After posting this issue on the SQLAlchemy github (issue link) the SQLAlchemy devs confirmed that the query of [INFORMATION_SCHEMA].[TABLES] is in fact being caused by the Pandas function to_sql().

So, my new question is does anyone know how to disable this behavior in the Pandas to_sql() function? I looked over the documentation and could not find anything that would seem to help.

Joe
  • 418
  • 4
  • 12
  • Possibly related: https://dba.stackexchange.com/q/164429/21390 – Gord Thompson Oct 30 '20 at 17:53
  • Related: https://stackoverflow.com/questions/59636271/sql-server-queries-beyond-a-certain-character-length-are-timing-out – Nick ODell Oct 31 '20 at 15:36
  • Discussed on GitHub [here](https://github.com/sqlalchemy/sqlalchemy/issues/5679). – Gord Thompson Oct 31 '20 at 17:43
  • @GordThompson the SQLAlchemy dev mentions using either `ALTER DATABASE MyDatabase SET ALLOW_SNAPSHOT_ISOLATION ON` or `ALTER DATABASE MyDatabase SET READ_COMMITTED_SNAPSHOT ON` to not lock as aggressively by turning on snapshot isolation. I'm concerned about making these changes on the database level. Do you know of a setting in SQL server to make queries on `[INFORMATION_SCHEMA].[TABLES]` not block when an open transaction exists that has created a new table in that database? – Joe Oct 31 '20 at 17:52
  • You could try `create_engine(connection_uri, isolation_level="READ_UNCOMMITTED")` and see if that prevents the hang. – Gord Thompson Oct 31 '20 at 18:47
  • Unfortunately that did not prevent the query from hanging. – Joe Oct 31 '20 at 20:32
  • Hmm, it works for me. (Hangs without `isolation_level` and does not hang when I add it.) Perhaps try `engine = create_engine(connection_uri)` and the omit the `with` block and call `to_sql` with `engine` as the second parameter. That works for me, too. – Gord Thompson Oct 31 '20 at 21:01
  • That is strange, it doesn't work for me. Also just tried engine = ... as you suggested and it also did not work. The code hangs on `table_names = [r[0] for r in connection.execute(s)]` in `get_table_names` of `sqlalchemy.dialects.mssql.base.py`. Just to be clear, when you test this, you have an open transaction on the same database where you have created a new table in that database in the open transaction? That's what I have and as soon as I rollback or commit that transaction in SQL the Python code continues running again. – Joe Oct 31 '20 at 22:45
  • @Joe - *"Just to be clear, when you test this, you have an open transaction on the same database where you have created a new table in that database in the open transaction?"* - Yes, exactly. – Gord Thompson Nov 01 '20 at 00:38
  • Darn. Still doesn't work for me. Perhaps their are some database related settings that are causing the difference. I wonder if their is some way I can confirm the `isolation_level="READ_UNCOMMITTED"` option is being used in general. – Joe Nov 01 '20 at 17:11
  • `SELECT transaction_isolation_level FROM sys.dm_exec_sessions WHERE session_id = @@SPID;` returns the numeric value. `1` means READ_UNCOMMITTED. – Gord Thompson Nov 02 '20 at 14:05
  • Anyone ever figure out a fix for this? We've been stuck for a couple of days because of this. – PlunkettBoy Apr 29 '21 at 19:27
  • I have not...just ended up monkey patching Pandas for the time being. – Joe May 21 '21 at 17:39
  • Encountered this too while a separate process is creating a large columnstore index. It locks information_schema and prevents to_sql from running. – Gabe Mar 20 '23 at 20:55

3 Answers3

1

I'm not very familiar with SQLAlchemy, but I can tell you about the Pandas side of this issue.

Pandas automatically creates a new table if the table doesn't exist. The way it figures out whether the table exists is that it calls has_table() in SQL Alchemy. The way has_table() works is that it queries the information schema. (At least, it works that way in MySQL and MSSQL.)

Implementation details

Here's what I found tracing the logic for this in Pandas and SQLAlchemy. We start in pandas/io/sql.py, inside to_sql().

        table = SQLTable(
            name,
            self,
            frame=frame,
            index=index,
            if_exists=if_exists,
            index_label=index_label,
            schema=schema,
            dtype=dtype,
        )
        table.create()

SQLTable.create() is defined here:

class SQLTable(PandasObject):
    [...]
    def create(self):
        if self.exists():
            if self.if_exists == "fail":
                raise ValueError(f"Table '{self.name}' already exists.")
            elif self.if_exists == "replace":
                self.pd_sql.drop_table(self.name, self.schema)
                self._execute_create()
            elif self.if_exists == "append":
                pass
            else:
                raise ValueError(f"'{self.if_exists}' is not valid for if_exists")
        else:
            self._execute_create()

Notice that it calls exists() unconditionally. Inside SQLTable.exists(), you'll find this:

    def exists(self):
        return self.pd_sql.has_table(self.name, self.schema)

This eventually calls has_table() in SQLAlchemy: https://docs.sqlalchemy.org/en/13/core/internals.html#sqlalchemy.engine.default.DefaultDialect.has_table

For MSSQL, this is implemented in sqlalchemy/dialects/mssql/base.py in SQLAlchemy:

    @_db_plus_owner
    def has_table(self, connection, tablename, dbname, owner, schema):
        if tablename.startswith("#"):  # temporary table
            [...]
        else:
            tables = ischema.tables

            s = sql.select(tables.c.table_name).where(
                sql.and_(
                    tables.c.table_type == "BASE TABLE",
                    tables.c.table_name == tablename,
                )
            )

            if owner:
                s = s.where(tables.c.table_schema == owner)

            c = connection.execute(s)

            return c.first() is not None

(ischema is an abbreviation of information_schema, and this code is running a select on that table.)

How to fix this

I don't see a good, simple way to fix this. Pandas assumes that has_table() is a cheap operation. MSSQL doesn't follow that assumption. No matter what if_exists is set to, Pandas will call has_table() during to_sql().

I can think of a hacky way of doing this, though. If you were to monkey-patch pandas.io.sql.SQLTable.create() so that it's a no-op, then you could trick Pandas into thinking that the table already exists. The downside of this is that Pandas won't automatically create tables.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • 2
    I think you are basically on the right track, but the issue occurs after the insert. When Pandas to_sql does a check on whether the table might have case sensitivity issues (look for: `# check for potentially case sensitivity issues (GH7815)` in to_sql) it ends up calling the function `get_table_names` in SQLAlchemy which is ultimately what causes the block. – Joe Oct 31 '20 at 17:16
  • 1
    one relevant (albeit closed) github issue here: https://github.com/pandas-dev/pandas/issues/36542 – Gabe Mar 27 '23 at 21:09
  • also here: https://github.com/pandas-dev/pandas/issues/52601 – Gabe Jul 17 '23 at 16:04
1

execute set transaction isolation level read uncommitted before calling to_sql/

Andy
  • 76
  • 2
0

I created this issue to keep track

This only occurs if the table name has any uppercase letters. MyTable will get stuck, while mytable will finish successfully.

Is this an issue that Pandas needs to correct? or SQLAchemy?

Workaround

There is a workaround - set Transaction Isolation Level to Read Uncommitted in the SQLAchemy engine.

https://docs.sqlalchemy.org/en/20/dialects/mssql.html#transaction-isolation-level

engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params)
                          , fast_executemany=True
                           , isolation_level="READ UNCOMMITTED"
                          )
Gabe
  • 5,113
  • 11
  • 55
  • 88