Best way to use Pyspark with SQL DB

Question

My SQL DB has tables with millions of records some of them with thousand millions and my main select is around 4000 lines of code but the structure is something like this:

SELECT A.seq field1, field2, field3, field4,
       (select field from tableX X... where A.seq = X.seq ...) field5,
       (select field from tableY Y... where A.seq = Y.seq ...) field6,
       (select field from tableN Z... where A.seq = Z.seq ...) field7,
       field8, field9
  FROM tableA A, tableB B, tableN N
 WHERE A.seq = B.seq
   AND A.req_seq = N.req_seq;

My idea is to do something like this:

# load the tables in the cluster separately

conf = SparkConf().setAppName("MyApp")
sc = SparkContext(master="local[*]", conf=conf)
sql = HiveContext(sc)    

dataframeA = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableA)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeB = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

dataframeC = sql.read.format("jdbc").option("url",
                                    "db_url")\
    .option("driver", "myDriver")\
    .option("dbtable", tableC)\
    .option("user", "myuser")\
    .option("password", "mypass").load()

# then do the needed joins

df_aux = dataframeA.join(dataframeB, dataframeA.seq == dataframeB.seq)

df_res_aux = df_aux.join(dataframeC, df_aux.req_seq == dataframeC.req_seq)


# then with that dataframe calculate the subselect fields

def calculate_field5(seq):
    # load the table in the cluster as with the main tables 
    # and query the datafame
    # or make the query to DB and return the field
    return field

df_res = df_res_aux.withColumn('field5', calculate_field5(df_res_aux.seq))
# the same for the rest of fields

Is that a good way? Should I approach in a different way?

Any advice will be really, really appreciated

score 0 · Accepted Answer · edited May 23 '17 at 12:32

Well,

If you wanto to use MySql in your execution, this is the way to do that.

But get some notes that, maybe your execution will take a lot of time to run due to mySql query time. MySql is not a distributed DB, so you can get a lot of time to retrieve data from your mySql.

What I suggest to you.

Try to retrieve you data to hdfs (If you have an HDFS), try to use Sqoop for that. Here one example how to use this in a incremental way.

Try to convert the data stored as Orc. See the example here.

This suggestion is to decrease the time of execution of your database. Everytime you request the data direct from your MySql. You are using the resources of MySql to send the data to Spark. In the way that I suggest, you can copy your DB to HDFS and take this data to Spark to process. This will not cause time execution from your DB.

Why use Orc? Orc is a good choice to convert your data in a compact and columnar structure. This will increase your data retrieve and search.

Thanks for the answer! I will take a look to that technologies. So, is preferable to retrieve all the needed tables to a filesystem or in-memory and then apply the filters — Alezis, Apr 10 '17 at 14:44

Best way to use Pyspark with SQL DB

1 Answers1