1

Is it possible to perform a "SELECT" statement query to a SQL server database from an Azure synapse workbook using Pyspark+SQL?

The only way I've been able to ingest data from a SQL Server database into Azure Synapse is by creating an integration pipeline.

I'm new to using Azure Synapse as well as Apache Spark, so any advice you can provide is much appreciated.

1 Answers1

1

This is possible in theory and I have tested with an Azure SQL Database. I'm not 100% sure it would work with a SQL Server. It would require the network security to be right and there should be a line of sight between the two databases. Is your SQL Server in Azure for example, are they on the same vnet or peered vnets?

A simple example in a Synapse notebook:

import pyodbc

sqlQuery = "select @@version v"

try:

  conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
                        'SERVER=someSynapseDB.sql.azuresynapse.net;'
                        'DATABASE=yourDatabaseName;UID=someReadOnlyUser;'
                        'PWD=youWish;', autocommit = True )

  cursor = conn.cursor()
  cursor.execute(sqlQuery) 

  row = cursor.fetchone()
  while row:
    print(row[0])
    row = cursor.fetchone()

except:
  raise

finally:
  # Tidy up
  cursor.close()
  conn.close()

My results:

results

Inspired by this post by Jovan Popovic:

https://techcommunity.microsoft.com/t5/azure-synapse-analytics/query-serverless-sql-pool-from-an-apache-spark-scala-notebook/ba-p/2250968

Just out of interest is there a particular reason you are doing this in notebooks? Synapse pipelines are a perfectly good way of doing it, and a typical pattern would be to stage the data in a data lake eg is there some special functionality you need to use notebooks for?

wBob
  • 13,710
  • 3
  • 20
  • 37
  • 1
    Hey wBob, thank you for your solution. The SQL Server would be on a managed instance. I do like ingesting data through pipeline, though I wanted to avoid running a cost whenever I was copying any data over into synapse. I guess that wouldn't necessarily matter in the long run if I'm doing any orchestration with a synapse workbook. Thank you again :) – CrudeOperator May 13 '21 at 21:16
  • I am wondering the same thing @CrudeOperator. It seems like reading data like this in a notebook would be more efficient than having a pipeline move it to your primary data lake, and then reading it into a datagrame from there. But I'm new to Synapse and Spark and not clear on specifics - like isn't this dataframe going to be allocated in your primary data lake? or with this design that data never is saved to your primary data lake, rather it is allocated in memory across your Spark cluster? – Dudeman3000 Mar 08 '22 at 19:17