0

I want to use Spark to read all records from an Oracle table.

This table assumes a total of 10,000,000 records.

Is the following optimization feasible?

val table = spark.read
  .format("jdbc")
  .option("driver", "oracle.jdbc.driver.OracleDriver")
  .option("url", "jdbc:oracle:thin:@ip:1521:dbname")
  .option("user", "")
  .option("password", "")
  .option("dbtable", s"(select a.*, ROWNUM rownum__rn from tbname a) b")
  .option("fetchsize", 100000)
  .option("partitionColumn", "rownum__rn")
  .option("lowerBound", 0)
  .option("upperBound", 10000000)
  .option("numPartitions", 10)
  .load()
  .drop("rownum__rn")

I want to know if the DataFrame obtained by the above code has a one-to-one correspondence with the records in the table, that is, there is no duplication and omission.

If the above optimization is feasible, does it mean that executing the following statement multiple times will return the data in the same order?

select a.*, ROWNUM rownum__rn from tbname a

Versions:

  • 0racle release 11.2.0.4.0
  • Spark 2.3.0
zero323
  • 322,348
  • 103
  • 959
  • 935
xuejianbest
  • 323
  • 1
  • 9

2 Answers2

1

No, you can't rely on that presumption as query without ORDER BY might, but also might (and probably will) not return rows in the same order, especially for 10 million rows.

Either include ORDER BY clause, or - if you're interested in some "row number", select it using

select row_number() over (order by deptno, hiredate desc) rn, 
  dname, loc, ename, job, sal
from your_table

(column names are meaningless, I just meant to show how to do that).

Also, fetching 10 million rows - what do you plan to do with that much rows? You probably won't be showing them to an end user, will you? Just being curious.

Littlefoot
  • 131,892
  • 15
  • 35
  • 57
0

10 million rows- Why is it important to maintain the order of so many rows. I don't think any application will ever write a logic that wants all 10million rows to be in specific order. Spark reads data from partitions present in the underlying data storage. Now, the read operation in spark is a parallel operation and if one partition is created in memory then the application processing logic will start and it will not wait all the load to happen. This is asynchronous load of data to memory.

Once data is loaded into memory, you can also think of using coalesce function or repartition function to have the number of partitions with spark memory.

Prashant
  • 702
  • 6
  • 21