Spark JDBC read ends up in one partition only

Question

I have the below code snippet for reading data from a Postgresql table from where I am pulling all available data i.e. select * from table_name :

 jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", self.var_dict['jdbc_url']) \
    .option("dbtable", "({0}) as subq".format(query)) \
    .option("user", self.var_dict['db_user']) \
    .option("password", self.var_dict['db_password']) \
    .option("driver", self.var_dict['db_driver']) \
    .option("numPartitions", 10) \
    .option("fetchsize", 10000) \
    .load()

Where var_dict is a dictionary containing my variables likes spark context , database creds etc.

Even when I am pulling millions of rows the result from below code returns 1 always:

partitions_num = jdbcDF.rdd.getNumPartitions()

Can someone advise if I am doing something wrong here? Ideally I would be expected to use maximum available resources rather then pulling the data to my master node only.

partitionColumn, lowerBound, upperBound cannot be used as my partition column is a timestamp and not numeric.

Possible duplicate of [Partitioning in spark while reading from RDBMS via JDBC](https://stackoverflow.com/questions/43150694/partitioning-in-spark-while-reading-from-rdbms-via-jdbc) — 10465355, Feb 28 '19 at 14:27
Please see my answer here: https://stackoverflow.com/a/40938905/2639647 — Travis Hegner, Feb 28 '19 at 15:39

score 2 · Accepted Answer · answered May 30 '20 at 05:58

2

From spark 2.4.0, date and timestamp column is also supported for partitioning, https://issues.apache.org/jira/browse/SPARK-22814

answered May 30 '20 at 05:58

mahendra singh

34
3

Spark JDBC read ends up in one partition only

1 Answers1