1

I am not an expert of Spark SQL API, nor of the underlying RDD one.

But, knowing of the Catalyst optimization engine, I would expect Spark to try and minimize in-memory effort.

This is my situation: I have, let's say, two table

TABLE GenericOperation (ID, CommonFields...)
TABLE SpecificOperation (OperationID, SpecificFields...)

They are both quite huge (~500M, not big data, but unfeasible to have as a whole in memory in a standard application server)

That said, suppose I have to retrieve using Spark (part of a larger use case) all the SpecificOperation instances that match some particular condition on fields that belong to GenericOperation.

This is the code that I am using:

val gOps = spark.read.jdbc(db.connection, "GenericOperation", db.properties)
val sOps = spark.read.jdbc(db.connection, "SpecificOperation", db.properties)
val joined = sOps.join(gOps).where("ID = OperationID")
joined.where("CommonField= 'SomeValue'").select("SpecificField").show()

Problem is, when it comes to run the above, I can see from SQL Profiler that Spark does not execute the join on the database, but rather retrieves all the OperationID from SpecificOperation, and then I assume it will be running all the merge in memory. Since no filter is applicable on SpecificOperation, such retrieve would bring a lot, too much, data to the end system.

Is it possible to write the above so that the join is demanded directly to dbms? Or it depends on some magic configuration of Spark I am not aware of?

Of course, I could simply hardcode the join as a subquery when retrieving, but that's not feasible in my case: statements hve to be created at runtime starting from simple building blocks. Hence, I need to implement this starting from two spark.sql.DataFrame already built up

As a side note, I am running this with Spark 2.3.0 for Scala 2.11, against a SQL Server 2016 database instance.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Possible duplicate of [In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?](https://stackoverflow.com/questions/38729436/in-apache-spark-2-0-0-is-it-possible-to-fetch-a-query-from-an-external-database) – Alper t. Turker Jun 05 '18 at 17:33
  • Not quite, but thank you for the suggest. I edited the question to disambiguate: in my case I can't act upon the logic prior to the join – Simone Colucci Jun 05 '18 at 17:52
  • I dont think it will work that way. You can push the query to the database with spark.read.jdbc(query) – loneStar Jun 05 '18 at 19:31

1 Answers1

1

Is it possible to write the above so that the join is demanded directly to dbms? Or it depends on some magic configuration of Spark I am not aware of?

Excluding statically generated queries (In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?), Spark doesn't support join pushdown. Only predicates and selection can be delegated to the source.

There is no magic configuration or code that could even support this type of process.

In general if server can handle join, data is usually not large enough to benefit from Spark.

  • 1
    Thanks, that helps. Spark has been chosen for many reasons, none of them being the volume of data (huge, but not that much)...still, do you know if that is a nice to have they haven’t implemented so far or if there are obstacles that could prevent its development as a feature in any case? For example, I see they were working on something similar https://www.slideshare.net/databricks/extending-apache-spark-sql-data-source-apis-with-join-push-down-with-ioana-delaney-and-jia-li – Simone Colucci Jun 05 '18 at 21:11