2

I have a bunch of MySQL tables that I need to perform some analysis on. I have currently exported the tables as CSV files and has put them on HDFS. I read each of the tables into different RDDs from HDFS on PySpark to do the analysis now.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('hdfs://path/to/file.csv')

Today I came to know you can read the tables directly from MySQL onto Spark. Is there any performance enhancements doing it this way? What is the standard procedure to follow when you are working on huge RDBMS tables with Spark?

2 Answers2

2

Direct JDBC connection gives you some other options like partitionColumn, lowerBound, upperBound, numPartitions

df = sqlContext.read.format('jdbc').options(url='jdbc:postgresql:dbserver', dbtable='schema.tablename').load()

Or one more important feature, you can filter you data based on query. Check this link

df = sqlContext.read.format('jdbc').options(url='jdbc:postgresql:dbserver', dbtable='(select id,name from emp) as emp').load()
Community
  • 1
  • 1
Kaushal
  • 3,237
  • 3
  • 29
  • 48
1

You can do this:

props = {"driver": "com.vertica.jdbc.Driver"}
df = spark.read.jdbc(
        url="jdbc:vertica://server:PORT/database?user=someUser&password=Password1",
        table="(select * from x where y=0) someAliasThatDontMatter",
        properties=props
    )

The alias bit in the SQL statement allows you to filter, join etc

ThatDataGuy
  • 1,969
  • 2
  • 17
  • 43