Read Query from MariaDB using PySpark

Question

I am trying to read a results of query from MariaDB to pyspark dataframe. The jar I have used is

--jars mariadb-java-client-2.2.2.jar

I am able to read a table using

df = spark.read.format("jdbc")\
        .option("url","jdbc:mariadb://xxx.xxx.xx.xx:xxxx/hdpms")\
        .option("driver", "org.mariadb.jdbc.Driver")\
        .option("dbtable", Mytable)\
        .option("user", "xxxxx_xxxxx")\
        .option("password", "xxxxx")\
        .load()

Now I am looking for a command to just run a simple query such as

SELECT col1,col2,col3,.. From MyTable Where date>2019 and cond2;

Though I am able to run it using giving the query as

"MyTable date>2019 and cond2 --"

as jar adds SELECT * FROM at the beginning and where 1=0 at the end but I am facing following error

    py4j.protocol.Py4JJavaError: An error occurred while calling o455.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 12, xhadoopm3095p.aetna.com, executor 2): java.sql.SQLException: Value "DATE_CREATED" cannot be parse as Timestamp
        at org.mariadb.jdbc.internal.com.read.resultset.rowprotocol.TextRowProtocol.getInternalTimestamp(TextRowProtocol.java:592)
        at org.mariadb.jdbc.internal.com.read.resultset.SelectResultSet.getTimestamp(SelectResultSet.java:1178)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$11.apply(JdbcUtils.scala:439)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$11.apply(JdbcUtils.scala:438)

can anyone please help me with this. Thank you

Does this answer your question? [In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?](https://stackoverflow.com/questions/38729436/in-apache-spark-2-0-0-is-it-possible-to-fetch-a-query-from-an-external-database) — user10938362, Jan 28 '20 at 01:55

score 1 · Accepted Answer · answered Jan 28 '20 at 09:27

df = spark.read.format("jdbc")\
        .option("url","jdbc:mariadb://xxx.xxx.xx.xx:xxxx/hdpms")\
        .option("driver", "org.mariadb.jdbc.Driver")\
        .option("dbtable", "(SELECT col1,col2,col3,.. From MyTable Where date>2019 and cond2) tmp")\
        .option("user", "xxxxx_xxxxx")\
        .option("password", "xxxxx")\
        .load()

create a alias for table using query and it will work

Read Query from MariaDB using PySpark

1 Answers1