In the following sample code, in one cell
of our Azure Databricks notebook, the code loads about 20 million records into a Python pandas dataframe
from an Azure SQL db
, does some dataframe column tranformation by applying some functions (as shown in the code snippet below). But after running the code for about half an hour, the Databricks throws the following error:
Error:
ConnectException: Connection refused (Connection refused)
Error while obtaining a new communication channel
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.
Remarks: Table has about 150 columns. The Spark setting
on the Databricks is as follows:
Cluster: 128 GB , 16 Cores, DBR 8.3, Spark 8.3, Scala 2.12
Question: What could be a cause of the error, and how can we fix it?
import sqlalchemy as sq
import pandas as pd
def fn_myFunction(lastname):
testvar = lastname.lower()
testvar = testvar.strip()
return testvar
pw = dbutils.secrets.get(scope='SomeScope',key='sql')
engine = sq.create_engine('mssql+pymssql://SERVICE.Databricks.NONPUBLICETL:'+pw+'MyAzureSQL.database.windows.net:1433/TEST', isolation_level="AUTOCOMMIT")
app_df = pd.read_sql('select * from MyTable', con=engine)
#create new column
app_df['NewColumn'] = app_df['TestColumn'].apply(lambda x: fn_myFunction(x))
.............
.............