Spark SQL (Scala) - What's the most efficient way to join two very large tables that have skewed keys

Question

My current job is to create ETL processes with SparkSQL/Scala using Spark 2.2 with Hive support (all tables are on Hive warehouse/HDFS).

One specific process requires joining a table with 1b unique records with another one of 5b unique records.

The join key is skewed, in the sense that some keys are repeated way more than others, but our Hive is not configured to skew by that field, nor it is possible to implement that in the current cenario.

Currently I read each table into two separate dataframes and perform a join between them. Tried inner join and a right outer join on the 5b table to see if there was any performance gain (I'd drop the rows with null records afterwards). Could not notice one, but it could be caused by cluster instability (am not sure if a right join would require less shuffling than an inner one)

Have tried filtering the keys from 1b table on the 5b one by creating a temp view and adding a where clause to the select statement of the 5b table, still couldn't notice any performance gains (obs: it's not possible to collect the unique keys from 1b table, since it'll trigger memory exception). Have also tried doing the entire thing on one SQL query, but again, no luck.

I've read some people talking about creating a PairRDD and performing partitionBy with hashPartitioner, but this seems outdated with the release of dataframes. Right now, I'm in search for some solid guidance for dealing with this join of two very large datasets.

edit: there's an answer here that deals pretty much with the same problem that I have, but it's 2 years old and simply tells me to firstly join a broadcasted set of records that correspond to the keys that repeat a lot, and then perform another join with the rest of the records, unioning the results. Is this still the best approach for my problem? I have skewed keys on both tables

Possible duplicate of [Spark final task takes 100x times longer than first 199, how to improve](https://stackoverflow.com/questions/38517835/spark-final-task-takes-100x-times-longer-than-first-199-how-to-improve) — Alper t. Turker, Apr 26 '18 at 16:00

Spark SQL (Scala) - What's the most efficient way to join two very large tables that have skewed keys

0 Answers0