My SQL DB has tables with millions of records some of them with thousand millions and my main select is around 4000 lines of code but the structure is something like this:
SELECT A.seq field1, field2, field3, field4,
(select field from tableX X... where A.seq = X.seq ...) field5,
(select field from tableY Y... where A.seq = Y.seq ...) field6,
(select field from tableN Z... where A.seq = Z.seq ...) field7,
field8, field9
FROM tableA A, tableB B, tableN N
WHERE A.seq = B.seq
AND A.req_seq = N.req_seq;
My idea is to do something like this:
# load the tables in the cluster separately
conf = SparkConf().setAppName("MyApp")
sc = SparkContext(master="local[*]", conf=conf)
sql = HiveContext(sc)
dataframeA = sql.read.format("jdbc").option("url",
"db_url")\
.option("driver", "myDriver")\
.option("dbtable", tableA)\
.option("user", "myuser")\
.option("password", "mypass").load()
dataframeB = sql.read.format("jdbc").option("url",
"db_url")\
.option("driver", "myDriver")\
.option("dbtable", tableC)\
.option("user", "myuser")\
.option("password", "mypass").load()
dataframeC = sql.read.format("jdbc").option("url",
"db_url")\
.option("driver", "myDriver")\
.option("dbtable", tableC)\
.option("user", "myuser")\
.option("password", "mypass").load()
# then do the needed joins
df_aux = dataframeA.join(dataframeB, dataframeA.seq == dataframeB.seq)
df_res_aux = df_aux.join(dataframeC, df_aux.req_seq == dataframeC.req_seq)
# then with that dataframe calculate the subselect fields
def calculate_field5(seq):
# load the table in the cluster as with the main tables
# and query the datafame
# or make the query to DB and return the field
return field
df_res = df_res_aux.withColumn('field5', calculate_field5(df_res_aux.seq))
# the same for the rest of fields
Is that a good way? Should I approach in a different way?
Any advice will be really, really appreciated