0

I want to execute the following query on a remote Postgres server from a PySpark application using the JDBC connector:

SELECT id, postgres_function(some_column) FROM my_database GROUP BY id

The problem is I can't execute this kind of query on Pyspark using spark.sql(QUERY), obviously because the postgres_function is not an ANSI SQL function supported since Spark 2.0.0.

I'm using Spark 2.0.1 and Postgres 9.4.

zero323
  • 322,348
  • 103
  • 959
  • 935
Jocer
  • 542
  • 1
  • 5
  • 20

1 Answers1

1

The only option you have is to use subquery:

table = """
  (SELECT id, postgres_function(some_column) FROM my_database GROUP BY id) AS t
"""
sqlContext.read.jdbc(url=url, table=table)

but this will execute a whole query, including aggregation, on the database side and fetch the result.

In general it doesn't matter if function is an ANSI SQL function or if it has an equivalent in the source database and ll functions called in spark.sql are executed in Spark after data is fetched.

zero323
  • 322,348
  • 103
  • 959
  • 935