I have the following task:
- load data from one table from multiple schemas
- use PySpark
- use one user which have access to all schemas in DB
I am using the following code (more or less):
def connect_to_oracle_db(spark_session, db_query):
return spark_session.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:@//<host>:<port>/<srvice_name") \
.option("user", "<user>") \
.option("password", "<pass>") \
.option("dbtable", db_query) \
.option("driver", "oracle.jdbc.driver.OracleDriver")
def run(self):
all_schemes = <list of all available schemes>
for str_schema in all_schemes:
db_query = "(Select * from " + str_schema + ".TABLE1) TABLE1_DATA"
df_table1 = slef.connect_to_oracle_db(db_query).load()
# process df_table1
There are around 300 schemes and and it is quite slow because each for iteration the new connection is created and close. I want to find a way how to reuse the existing connection or somehow create the connection pool. It looks quite ineffective for me.
Do you have any idea how to reuse the connection or create connection pool for PySpark?