I am connected via jdbc
to a DB having 500'000'000 of rows and 14 columns.
Here is the code used:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
properties = {'jdbcurl': 'jdbc:db:XXXXXXXXX','user': 'XXXXXXXXX', 'password': 'XXXXXXXXX'}
data = spark.read.jdbc(properties['jdbcurl'], table='XXXXXXXXX', properties=properties)
data.show()
The code above took 9 seconds to display the first 20 rows of the DB.
Later I created a SQL temporary view via
data[['XXX','YYY']].createOrReplaceTempView("ZZZ")
and I ran the following query:
sqlContext.sql('SELECT AVG(XXX) FROM ZZZ').show()
The code above took 1355.79 seconds (circa 23 minutes). Is this ok? It seems to be a large amount of time.
In the end I tried to count the number of rows of the DB
sqlContext.sql('SELECT COUNT(*) FROM ZZZ').show()
It took 2848.95 seconds (circa 48 minutes).
Am I doing something wrong or are these amounts standard?