I am reading in a table from our SQL Server into Databricks. The source SQL table contains ~50M rows. On read, I am selecting one client, like this:
table1 = "(select Client, Revenue from dbo.sqltable where Client in ('0000000001')) A"
df = spark.read.options(user=user, password=pswd, columnEncryptionSetting='Enabled') \
.jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + '; database=' + database, table1)
When I df.show()
to show the result, I can see the table BUT it takes over a minute to process 400 rows (!). I thought maybe this is because the source table is so large and the .show()
is working off the original 50M row df even though I filter the df upfront. If this is the case, how do I get my processing down to 0.1 seconds on the sampled df which is what I would expect for a .show()
on a 400 row df (this is what we get on our dev cluster for 20K rows)? I have tried to create a temp view of the sampled df to see if this makes a difference but I don't know if I am on the right track:
df.createOrReplaceTempView('dftmp')
spark.table("dftmp")
spark.table("dftmp").cache
spark.table("dftmp").count
All I get out of this is:
<bound method DataFrame.count of DataFrame[Client: string, Revenue: decimal(17,2)]>
I got the above from here: how-does-createorreplacetempview-work-in-spark