SparkSQL : How to specify partitioning column while loading dataset from database

Question

I am using Spark 2.3, and loading data from MySQL using jdbc as below

  val dataSet:Dataset[Row] = _spark
    .read
    .format("jdbc")
    .options(Map("url" -> jdbcUrl
                ,"user" -> username
                ,"password" -> password
                ,"dbtable" -> dataSourceTableName
                ,"driver" -> driver
                ))
    .load()

I would like to partition the dataset based on a particular column in the table. How can I achieve this?

Frank Wilson · Answer 1 · 2018-12-26T07:11:39.970

You need to specify partitionColumn, upperBound, lowerBound and numPartitions options.

These are described in the property table in the JDBC documentation for spark sql.

These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.

For further explanation of the upperBound and lowerBound parameters can be found @PIYUSH PASARI's answer.

He gives the following example of the queries generated with following parameter values

upperBound = 500, lowerBound = 0 and numPartitions = 5.

SELECT * FROM table WHERE partitionColumn < 100 or partitionColumn is null
SELECT * FROM table WHERE partitionColumn >= 100 AND <200 
SELECT * FROM table WHERE partitionColumn >= 200 AND <300
SELECT * FROM table WHERE partitionColumn >= 300 AND <400
...
SELECT * FROM table WHERE partitionColumn >= 400

This can be seen from the code in JDBCRelation.scala.

As you can see all rows are fetched but if your upper and lower bound do not cover the whole data range the first and last partitions may be bigger than the others. If you can't be sure of the upper and lower bounds, would like even paritions and are not concerned with getting every row you could always set the upper and lower bounds as conditions in your dbtable parameter.

Thanks, is there a way to avoid giving "upperBound and lowerBound" and just give partitionColumn only — Remis Haroon - رامز, Dec 26 '18 at 06:39
No each parameter must be specified. They are only used to specify the stride length for each parition. All rows will be returned, but if your bounds are not correct the first and last partitions maybe larger than the rest. If want even paritions and only want results in a given range you should also set upper and lower bounds in the query. — Frank Wilson, Dec 26 '18 at 06:42

score 1 · Answer 2 · answered Dec 26 '18 at 06:40

spark.read("jdbc")
  .option("url", url)
  .option("dbtable", "pets")
  .option("user", user)
  .option("password", password)
  .option("numPartitions", 10)
  .option("partitionColumn", "owner_id")
  .option("lowerBound", 1)
  .option("upperBound", 10000)

SparkSQL : How to specify partitioning column while loading dataset from database

2 Answers2