I want to use Spark to read all records from an Oracle table.
This table assumes a total of 10,000,000 records.
Is the following optimization feasible?
val table = spark.read
.format("jdbc")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("url", "jdbc:oracle:thin:@ip:1521:dbname")
.option("user", "")
.option("password", "")
.option("dbtable", s"(select a.*, ROWNUM rownum__rn from tbname a) b")
.option("fetchsize", 100000)
.option("partitionColumn", "rownum__rn")
.option("lowerBound", 0)
.option("upperBound", 10000000)
.option("numPartitions", 10)
.load()
.drop("rownum__rn")
I want to know if the DataFrame obtained by the above code has a one-to-one correspondence with the records in the table, that is, there is no duplication and omission.
If the above optimization is feasible, does it mean that executing the following statement multiple times will return the data in the same order?
select a.*, ROWNUM rownum__rn from tbname a
Versions:
- 0racle release 11.2.0.4.0
- Spark 2.3.0