0

Databricks documentation mentions that we have to provide 'partitionColumn, lowerBound, upperBound and numPartitions' values while reading data from a relational database across mulitple workers. It is also mentioned that the partitionColumn should be a integer column. How can I then make parallel reads from table that doesnt have integer column?

Databricks Documentation

I tried using 'rownum'(source is Oracle DB) as the partition column but when i try to store the results in a dataframe, i get only the data from the first partition.

Here is the code:

jdbcUrl = "jdbc:oracle:thin:@//{0}:{1}/{2}".format(hostname, port, db)



connectionProperties = {
  "user" : 'XXXXXX',
  "password" : 'XXXXXX',
  "driver" : "oracle.jdbc.driver.OracleDriver",
  "oracle.jdbc.timezoneAsRegion" : "false"
}

parallel_df = spark.read.jdbc(url=jdbcUrl,
                          table=table_name, 
                          column='rownum', 
                          lowerBound=1, 
                          upperBound=200000, 
                          numPartitions=20,
                          properties=connectionProperties)

When i get the count of parallel_df, i get only 200000/20 = 10000 rows. Can anyone provide any insight on how to do this parallel read?

1 Answers1

1

When i get the count of parallel_df, i get only 200000/20 = 10000 rows. Can anyone provide any insight on how to do this parallel read?

when you read dataframe in that way (i.e. with partititons) spark will do a query for each partition, in your case something like

select t.* from table t where rownum between (1, 10000)  on executor 1
select t.* from table t where rownum between (10001, 20000) on execuotr 2

and so on ...

from the oracle point of view queries are unrelated and rownum (as usual) will always starts from 1, so you get only first 10000 rows coming from the first query.

Databricks documentation mentions that we have to provide 'partitionColumn, lowerBound, upperBound and numPartitions' values while reading data from a relational database across mulitple workers.

True, but you don't have to do that at any cost. If your data doesn't have a column suitable for data partitioning/splitting then just don't use that feature

val df = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
gtosto
  • 1,381
  • 1
  • 14
  • 18
  • Thanks @gtosto. Is there a way to get all the partitions into a single dataframe? Like an aggregate parameter or similar? – Bharath Mohan Jul 10 '18 at 22:19
  • you **always** get all partitions in a dataframe. Dataframe (and the underlying RDD) are just an abstraction for _distributed data_ – gtosto Jul 12 '18 at 13:10