Spark allows you to read in parallel from a sql db source, and one can partition based on a sliding window, for example (from the book, chapter 7)
val colName = "count"
val lowerBound = 0L
val upperBound = 348113L // this is the max count in our table
val numPartitions = 10
spark.read.jdbc(url,
tablename,
colName,
lowerBound,
upperBound,
numPartitions,
props).count()
Here, the upper bound is known before hand.
Lets say, a table gets 'x' number of rows(which can be between 1-2 million) in a day and at the end of the day we submit a spark job, do some transformations and write to a Parquet/CSV/JSON. If we don't know before hand about how many rows will be written (as it varying from 1-2 million) to the SQL source database, in such a scenario what will be the best approach or practise to do a partition.
One way is to either have an estimation of your upper bound, but I am not sure this is a right approach.