Spark "distribute by" explodes size of original data

Question

I'm trying to figure out why my 15 GB table balloons to 182 GB when I run a simple query on it.

First I read the table into Spark from Redshift. When I tell Spark to do a simple count on the table, it works fine. However, when I try to create a table I get all kinds of YARN failures and ultimately some of my tasks have shuffle spill memory of 182 GB.

Here is the problematic query (I've changed some of the names):

CREATE TABLE output
  SELECT
    *
  FROM inputs
  DISTRIBUTE BY id
  SORT BY snapshot_date

What is going on? How could the shuffle spill exceed the total size of the input? I'm not doing a cartesian join, or anything like that. This is a super simple query.

I'm aware that Red Hat Linux (I use EMR on AWS) has virtual memory issues, since I came across that topic here, but I've added the recommended config classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false] to my EMR properties and the issue persists.

Here is a screenshot from the Spark UI, if it helps:

Have you checked if there is skew on column ID? try to run something similar to `select id,count(*) as cnt from inputs order by cnt desc limit 100` to see if there are any outliers that might be causing huge spike of data on executor #66 — Pushkr, Apr 04 '19 at 00:29
@Pushkr My data is definitely skewed, but that doesn't explain why the shuffle spill would be 10x the size of the original input. @dytyniak Redshift tracks table MB size in a table called `svv_table_info`: https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_TABLE_INFO.html — user554481, Apr 04 '19 at 14:35
@PrateekPrateek I get these errors when running pyspark interactively. The command I use to start the shell is this: `pyspark --jars $JARS --driver-java-options -Dlog4j.configuration=file:///etc/alternatives/spark-conf/log4j.properties` — user554481, Apr 04 '19 at 19:39
you have to provide proper memory. Try pyspark --master yarn --driver-memory 10G --num-executors 5 --executor-memory 20G --executor-cores 5 . here master can be yarn or local and adjust memory as per cluster — maogautam, Apr 04 '19 at 19:57

Spark "distribute by" explodes size of original data

0 Answers0