Databricks documentation mentions that we have to provide 'partitionColumn, lowerBound, upperBound and numPartitions' values while reading data from a relational database across mulitple workers. It is also mentioned that the partitionColumn should be a integer column. How can I then make parallel reads from table that doesnt have integer column?
I tried using 'rownum'(source is Oracle DB) as the partition column but when i try to store the results in a dataframe, i get only the data from the first partition.
Here is the code:
jdbcUrl = "jdbc:oracle:thin:@//{0}:{1}/{2}".format(hostname, port, db)
connectionProperties = {
"user" : 'XXXXXX',
"password" : 'XXXXXX',
"driver" : "oracle.jdbc.driver.OracleDriver",
"oracle.jdbc.timezoneAsRegion" : "false"
}
parallel_df = spark.read.jdbc(url=jdbcUrl,
table=table_name,
column='rownum',
lowerBound=1,
upperBound=200000,
numPartitions=20,
properties=connectionProperties)
When i get the count of parallel_df, i get only 200000/20 = 10000 rows. Can anyone provide any insight on how to do this parallel read?