I have a bunch of MySQL tables that I need to perform some analysis on. I have currently exported the tables as CSV files and has put them on HDFS. I read each of the tables into different RDDs from HDFS on PySpark to do the analysis now.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('hdfs://path/to/file.csv')
Today I came to know you can read the tables directly from MySQL onto Spark. Is there any performance enhancements doing it this way? What is the standard procedure to follow when you are working on huge RDBMS tables with Spark?