You need to specify partitionColumn
, upperBound
, lowerBound
and numPartitions
options.
These are described in the property table in the JDBC documentation for spark sql.
These options must all be specified if any of them is specified. In
addition, numPartitions
must be specified. They describe how to
partition the table when reading in parallel from multiple workers.
partitionColumn must be a numeric, date, or timestamp column from the
table in question. Notice that lowerBound
and upperBound
are just used
to decide the partition stride, not for filtering the rows in table.
So all rows in the table will be partitioned and returned. This option
applies only to reading.
For further explanation of the upperBound
and lowerBound
parameters can be found @PIYUSH PASARI's answer.
He gives the following example of the queries generated with following parameter values
upperBound = 500
, lowerBound = 0
and numPartitions = 5
.
SELECT * FROM table WHERE partitionColumn < 100 or partitionColumn is null
SELECT * FROM table WHERE partitionColumn >= 100 AND <200
SELECT * FROM table WHERE partitionColumn >= 200 AND <300
SELECT * FROM table WHERE partitionColumn >= 300 AND <400
...
SELECT * FROM table WHERE partitionColumn >= 400
This can be seen from the code in JDBCRelation.scala.
As you can see all rows are fetched but if your upper and lower bound do not cover the whole data range the first and last partitions may be bigger than the others. If you can't be sure of the upper and lower bounds, would like even paritions and are not concerned with getting every row you could always set the upper and lower bounds as conditions in your dbtable
parameter.