Spark SQL/Hive Query Takes Forever With Join

Question

So I'm doing something that should be simple, but apparently it's not in Spark SQL.

If I run the following query in MySQL, the query finishes in a fraction of a second:

SELECT ua.address_id
FROM user u
inner join user_address ua on ua.address_id = u.user_address_id
WHERE u.user_id = 123;

However, running the same query in HiveContext under Spark (1.5.1) takes more than 13 seconds. Adding more joins makes the query run for a very very long time (over 10 minutes). I'm not sure what I'm doing wrong here and how I can speed things up.

The tables are MySQL tables that are loaded into the Hive Context as temporary tables.This is running in a single instance, with the database on a remote machine.

user table has about 4.8 Million rows.
user_address table has 350,000 rows.

The tables have foreign key fields, but no explicit fk relationships is defined in the db. I'm using InnoDB.

The execution plan in Spark:

Plan:

Scan JDBCRelation(jdbc:mysql://.user,[Lorg.apache.spark.Partition;@596f5dfc, {user=, password=, url=jdbc:mysql://, dbtable=user}) [address_id#0L,user_address_id#27L]

Filter (user_id#0L = 123) Scan JDBCRelation(jdbc:mysql://.user_address, [Lorg.apache.spark.Partition;@2ce558f3,{user=, password=, url=jdbc:mysql://, dbtable=user_address})[address_id#52L]

ConvertToUnsafe ConvertToUnsafe

TungstenExchange hashpartitioning(address_id#52L) TungstenExchange hashpartitioning(user_address_id#27L) TungstenSort [address_id#52L ASC], false, 0 TungstenSort [user_address_id#27L ASC], false, 0

SortMergeJoin [user_address_id#27L], [address_id#52L]

== Physical Plan == TungstenProject [address_id#0L]

Please add the physical plan, and the effective SQL queries which are run against the DB. Further add the code which creates the data frames, and the query. — Beryllium, Dec 02 '15 at 08:01

zero323 · Accepted Answer · 2015-12-02T08:13:19.087

4

First of all type of query you perform is extremely inefficient. As for now (Spark 1.5.0*) to perform join like this, both tables has to be shuffled / hash-partitioned each time query is executed. It shouldn't be a problem in case of users table where user_id = 123 predicate is most likely pushed-down but still requires full shuffle on user_address.

Moreover, if tables are only registered and not cached, then every execution of this query will fetch a whole user_address table from MySQL to Spark.

I'm not sure what I'm doing wrong here and how I can speed things up.

It is not exactly clear why you want to use Spark for application but single machine setup, small data and type of queries suggest that Spark is not a good fit here.

Generally speaking if application logic requires a single record access then Spark SQL won't perform well. It is designed for analytical queries not as a OLTP database replacement.

If a single table / data frame is much smaller you could try broadcasting.

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.broadcast

val user: DataFrame = ???
val user_address: DataFrame = ???

val userFiltered = user.where(???)

user_addresses.join(
  broadcast(userFiltered), $"address_id" === $"user_address_id")

* This should change in Spark 1.6.0 with SPARK-11410 which should enable persistent table partitioning.

edited Dec 02 '15 at 08:13

answered Dec 02 '15 at 07:35

zero323

322,348
103
959
935

Just to clarify, the example above is just to demonstrate that this happens with a single record. Not limiting the query to a single record makes it much worse. What I'm trying to accomplish here is to (ab)use Spark as a unified data source for SQL + files. I tried caching the tables, but that resulted in the query hanging for over 15 minutes. – Ali B Dec 02 '15 at 07:58
It doesn't change my answer. It only makes LHS of join more expensive. It still has to fetch all data and shuffle. – zero323 Dec 02 '15 at 08:00
So, would you say a more tenable approach is to run the MySQL query over Spring/Hibernate, the file query over Spark, and then join the two in Spark? – Ali B Dec 02 '15 at 08:02
I wouldn't use Spark in the first place :) but if don't use MySQL for on-line queries then you can push a whole join with subquery: http://stackoverflow.com/a/32585936/1560062 – zero323 Dec 02 '15 at 08:04
Well, not using Spark makes sense for the DB, but for files that need to be loaded and queried in an ad-hoc manner, Spark seems like the perfect solution. – Ali B Dec 02 '15 at 08:07
2

In a single machine setup? Not really... You can take a look at the different SQL/MED (for example PostgreSQL FDW) though. In a distributed setup I would actually consider Apache Drill first. – zero323 Dec 02 '15 at 08:17

score 3 · Answer 2 · answered Dec 02 '15 at 08:20

I have had the same problem in a similar situation (Spark 1.5.1, PostgreSQL 9.4).

Given the two tables like

val t1 = sqlContext.read.format("jdbc").options(
  Map(
    "url" -> "jdbc:postgresql:db",
    "dbtable" -> "t1")).load()

val t2 = sqlContext.read.format("jdbc").options(
  Map(
    "url" -> "jdbc:postgresql:db",
    "dbtable" -> "t2")).load()

then the join in HQL over the registered temporary tables results in a full table scan over one of the tables (in my case it was the child).

Anyway a workaround is to push the query to the underlying RDBMS:

val joined = sqlContext.read.format("jdbc").options(
  Map(
    "url" -> "jdbc:postgresql:db",
    "dbtable" -> "(select t1.*, t2.* from t1 inner join t2 on ...) as t")).load()

This way the query optimizer of the underlying RDBMS kicks in, and in my case it switched to index scans. Spark on the other hand pushed down two independent queries, and a RDBMS can't really optimize this.

Spark SQL/Hive Query Takes Forever With Join

2 Answers2

Linked