I need to obtain some data from a Postgres database from Spark. Because the SQL queries are quite large (close to 300 lines), I would like to read the query from a sql file instead of hard-coding it in the Scala source code.
However, I am having trouble properly reading a sql file in a way that preserves syntax. I can run the query just fine in DBeaver, but Scala will throw a org.postgresql.util.PSQLException. A typical stacktrace will point to a seemingly random position in the SQL file, but does not provide any hints.
I am reading SQL files like this:
val query = sc.textfile("someQuery.sql").collect().mkString("\n")
Which gives a string which seems to look like the original sql query. A Postgres exception will be thrown when I then run
spark.read.jdbc(DBUrl, query, connectionProperties)
edit
After modifying it a bit, the solution given here ended up working for me.
Some notes for whom it may concern:
- always alias your query, ie. not "select * from tbl" but "(select * from tbl) tmp";
- use SparkSession instead of creating a SQLContext from a SparkContext;
This snippet ended up working for me:
val df = spark
.read.format("jdbc")
.options(
Map("url" -> jdbcUrl,
"dbtable" -> aliased_query,
"Driver" -> "org.postgresql.Driver",
"user" -> <user>,
"password" -> <password>)
).load()
(I know, don't hardcode passwords.)