0

I have a problem, where I see deadlocks on my Azure SQL Database. These occur since we changed the code from a virtual machine into Airflow. Which means that the most obvious changes are:

  • Windows Server 2022 Datacenter Azure Edition to Kubernetes 1.25.6
  • Version changes of SQLAlchemy and/or pandas.

In df.to_sql, it is always an if_exists='replace', then it always crashes on check_case_sensitive where it deadlocks on sys.tables. If it was on an actual data table I would understand it is something like a transaction commitment issue, if I checked the sys.tables myself I would maybe understand it, but the error is raised deep within the module. The case check is just a check to raise a warning "The provided table name '{name}' is not found exactly as such in the database after writing the table, possibly due to case sensitivity issues. Consider using lower case table names." without even fixing it.

select [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
from [INFORMATION_SCHEMA].[TABLES]
where [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST('dbo' as nvarchar(max))
    and [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST('BASE TABLE' as nvarchar(max))
order by [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]

We have no full transaction log on the SQL DB. What possibilities do I have to find the problem? I'm lost for ideas.

Traceback:

Traceback (most recent call last):
  File "/opt/airflow/dags/repo/dags/xyz.py", line 308, in xyzabc
    df.to_sql(tablename, con=engine, if_exists=if_exists, dtype=datatypedict, index=False, chunksize=10000)
  File "/home/airflow/.local/lib/python3.10/site-packages/pandas/core/generic.py", line 2878, in to_sql
    return sql.to_sql(
  File "/home/airflow/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 769, in to_sql
    return pandas_sql.to_sql(
  File "/home/airflow/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 1932, in to_sql
    self.check_case_sensitive(name=name, schema=schema)
  File "/home/airflow/.local/lib/python3.10/site-packages/pandas/io/sql.py", line 1832, in check_case_sensitive
    table_names = insp.get_table_names(schema=schema or self.meta.schema)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 266, in get_table_names
    return self.dialect.get_table_names(
  File "<string>", line 2, in get_table_names
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 55, in cache
    ret = fn(self, con, *args, **kw)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/dialects/mssql/base.py", line 2613, in wrap
    return _switch_db(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/dialects/mssql/base.py", line 2655, in _switch_db
    return fn(*arg, **kw)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/dialects/mssql/base.py", line 3065, in get_table_names
    table_names = [r[0] for r in connection.execute(s)]
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1385, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1577, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1948, in _execute_context
    self._handle_dbapi_exception(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2129, in _handle_dbapi_exception
    util.raise_(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1905, in _execute_context
    self.dialect.do_execute(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('40001', '[40001] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Transaction (Process ID 418) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. (1205) (SQLExecDirectW)')
[SQL: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]]
[parameters: ('dbo', 'BASE TABLE')]
(Background on this error at: https://sqlalche.me/e/14/dbapi)

JasperKPI
  • 11
  • 3
  • Why are you casting your input parameter to a `MAX`? That's going to cause performance issues (notice how much the [query plans differ](https://dbfiddle.uk/3_y1eRdY)). Also [3+ part naming on Columns will be Deprecated](https://wp.larnu.uk/3-part-naming-on-columns-will-be-deprecated/) and should be avoided. You are far better off aliasing your objects in the `FROM` and using those to qualify your columns. This avoids the use of a feature that will be deprecated and possibly removed in a future version, and makes your code much more succinct and readable. – Thom A Jul 24 '23 at 14:06
  • In Azure you can get the deadlock graphs from querying the `master` database as here https://techcommunity.microsoft.com/t5/azure-database-support-blog/lesson-learned-19-how-to-obtain-the-deadlocks-of-your-azure-sql/ba-p/368847. You should do that to see the exact cause – Martin Smith Jul 24 '23 at 14:13
  • does it do some sort of multi-threaded inserts together with various system tables fetches? – siggemannen Jul 24 '23 at 14:18
  • For deadlock diagnostics, we need to see the XDL/XML deadlock graph. The execution plan would also be useful. – Charlieface Jul 24 '23 at 14:30
  • 1
    @ThomA - Similar to [this case](https://stackoverflow.com/a/76707703/2144390), that query is not created by the OP. It is generated by SQLAlchemy. – Gord Thompson Jul 24 '23 at 14:36
  • *Yuck*, @GordThompson . That's definitely a bug. Or, if not, it's like the problem with `AddWithValue` turned up to 11. – Thom A Jul 24 '23 at 14:42
  • @ThomA - If you feel so inclined, you could ask about it [here](https://github.com/sqlalchemy/sqlalchemy/discussions). – Gord Thompson Jul 24 '23 at 14:49
  • I'm, unfortuantely not familiar with SQLAlchemy, @GordThompson , so I'm not sure I'd be in a *good* position to write up an issue, or if it's a case of that actually there's an better alternative within its syntax (like `Parameters.Add` in the prior example of `Parameters.AddWithValue`). – Thom A Jul 24 '23 at 14:52
  • @ThomA I haven't written it, it is fully managed within SQLAlchemy. I would never write this way, it's ridiculous. (Edit: as someone pointed out already) – JasperKPI Jul 24 '23 at 14:55
  • @MartinSmith That just shows "Deadlock occurred." Which is true, but yeah. There are no connectivity issues before or after. – JasperKPI Jul 24 '23 at 14:55
  • @siggemannen The code itself doesn't do multithreading, and there are no other similar codes running. Although strictly possible there is a Power BI refresh running at the same time, even that is very unlikely. – JasperKPI Jul 24 '23 at 14:55
  • @JasperKPI - you can get the XML deadlock graph - which is considerably more informative than that - the query against `sys.fn_xe_telemetry_blob_target_read_file` in the above link – Martin Smith Jul 24 '23 at 14:56
  • @MartinSmith oh wow, I'm going to look into this, this is really great. I found the deadlocks, and will report back. – JasperKPI Jul 24 '23 at 15:18
  • It deadlocks between different processes running against the same database, where even some of them aren't even on Airflow at all. However, we have been doing this for years already. – JasperKPI Jul 24 '23 at 15:28
  • 1
    There's nothing we can do with that information as we can't see the graph – Martin Smith Jul 24 '23 at 16:35

1 Answers1

0

As Martin Smith pointed out in the comments:

In Azure you can get the deadlock graphs from querying the master database as here Lesson Learned #19: How to obtain the deadlocks of your Azure SQL Database or Managed Instance?. You should do that to see the exact cause

Doing that, we found out that it deadlocked on completely different processes working on the same database, running on Airflow or not, and we probably didn't notice before. It didn't deadlock on itself, the previous query and/or multithreading.

Several options are:

  • There was an external cause that wasn't dependent on the migration (e.g. database pool busy)
  • We were focussing more on reading the logs during the migration, thus finding out about it where it previously just wasn't seen
  • Airflow has a more precise startup time, where on Windows the processes are pushed back if you start 10 processes at once, leading to more overlap on Airflow.

We have not found the exact cause, but I am fairly sure it wasn't a specific Pandas issue. The answer for someone else is: check the deadlocks, it probably is not what you think it is. It is most probably not SQLAlchemy or Pandas deadlocking itself. On Azure Elastic Pool, check your Pool usage. On SQL Server, check your resources.

You do not need Azure SQL Auditing or other optional transaction-style monitoring, it is quite a basic SQL Server system function.

JasperKPI
  • 11
  • 3