For my use case, I am trying to read one big oracle table using spark JDBC. Since, I do not have an integer type column in my table, I am using rownum
as paritionColumn
.
Here is what my spark query looks like: (For testing I am using a table with only 22000 rows.)
val df = spark.read.jdbc(jdbcUrl = url, table = select * from table1,
columnName= "rownum", lowerBound = 0, upperBound = 22000,
numPartitions = 3, connectionProperties = oracleProperties)
Ideally, it should return me 3 partitions with almost 7000 rows in each. But when I ran the count on each partitions of dataframe I can see that only one partition has rows while others are 0.
df.rdd.mapPartitionsWithIndex{case(i, rows) => Iterator((i, rows.size))}.toDF().show()
output:
+---+----+
| _1| _2 |
+---+----+
| 0 |7332|
| 1 | 0 |
| 2 | 0 |
+---+----+
Can you please suggest why its only returning rows in one partition?
My source is a Oracle Database. Using oracle jdbc driver
oracle.jdbc.driver.OracleDriver
jar --> ojdbc7.jar
reference thread: http://apache-spark-user-list.1001560.n3.nabble.com/Question-on-using-pseudo-columns-in-spark-jdbc-options-td30154.html