Sample from pyspark dataframe slow to process

Question

I am reading in a table from our SQL Server into Databricks. The source SQL table contains ~50M rows. On read, I am selecting one client, like this:

table1 = "(select Client, Revenue from dbo.sqltable where Client in ('0000000001')) A"

df = spark.read.options(user=user, password=pswd, columnEncryptionSetting='Enabled') \
  .jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + '; database=' + database, table1)

When I df.show() to show the result, I can see the table BUT it takes over a minute to process 400 rows (!). I thought maybe this is because the source table is so large and the .show() is working off the original 50M row df even though I filter the df upfront. If this is the case, how do I get my processing down to 0.1 seconds on the sampled df which is what I would expect for a .show() on a 400 row df (this is what we get on our dev cluster for 20K rows)? I have tried to create a temp view of the sampled df to see if this makes a difference but I don't know if I am on the right track:

df.createOrReplaceTempView('dftmp')
spark.table("dftmp")
spark.table("dftmp").cache
spark.table("dftmp").count

All I get out of this is:

<bound method DataFrame.count of DataFrame[Client: string, Revenue: decimal(17,2)]>

I got the above from here: how-does-createorreplacetempview-work-in-spark

Additionally to [Partitioning in spark while reading from RDBMS via JDBC](https://stackoverflow.com/q/43150694/10465355). And `count` and `cache` are methods in Python (should be called) and [doesn't make much sense in `DataFrame` API](https://stackoverflow.com/q/50379782/10465355). — 10465355, Feb 06 '19 at 10:21
Thanks @user10465355. df.cache() df.count() (to kick it into action which took a while - 1 minute) and then df.show() returned the results in 0.13 seconds — GivenX, Feb 06 '19 at 13:16

Sample from pyspark dataframe slow to process

0 Answers0