2

I want to read data from a database using Spark's JDBC API. I will be using 200 executors to read the data.

My question is that if i have provided 200 executor then will it create 200 connection to centralized database(JDBC) or will it fetch data from driver with single connection?

philantrovert
  • 9,904
  • 3
  • 37
  • 61
Atul Verma
  • 51
  • 10

1 Answers1

-2

When you establish connectivity spark.read.jdbc... you can specify numPartitions parameter. That manages max limit of how many parallel connection can be created.

The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

However, by default you read data to a single partition which usually doesn’t fully utilize your SQL database.

vvg
  • 6,325
  • 19
  • 36
  • 1
    In isolation `numPartitions` has no effect. It is used only if combined with other properties. – Alper t. Turker May 17 '18 at 14:31
  • @user8371915 of course, but nevertheless this parameter control parallelism. – vvg May 17 '18 at 15:22
  • If and only if it is combined with bounds and partition column and predicates argument is not provided. Otherwise it is ignored – Alper t. Turker May 17 '18 at 16:22
  • It works for *limiting* the number of connections used for writing though. Then `numPartitions` must be specified as an option to the `DataFrameWriter` (like `dataset.write().option("numPartitions", 50)`), not to the `DataFrameReader`. Then it will limit the number of connections used for writing with a "coalesce" operation. – ddekany Oct 12 '18 at 19:25