0

I'm trying to figure out why my 15 GB table balloons to 182 GB when I run a simple query on it.

First I read the table into Spark from Redshift. When I tell Spark to do a simple count on the table, it works fine. However, when I try to create a table I get all kinds of YARN failures and ultimately some of my tasks have shuffle spill memory of 182 GB.

Here is the problematic query (I've changed some of the names):

CREATE TABLE output
  SELECT
    *
  FROM inputs
  DISTRIBUTE BY id
  SORT BY snapshot_date

What is going on? How could the shuffle spill exceed the total size of the input? I'm not doing a cartesian join, or anything like that. This is a super simple query.

I'm aware that Red Hat Linux (I use EMR on AWS) has virtual memory issues, since I came across that topic here, but I've added the recommended config classification=yarn-site,properties=[yarn.nodemanager.vmem-check-enabled=false] to my EMR properties and the issue persists.

Here is a screenshot from the Spark UI, if it helps:

enter image description here

user554481
  • 1,875
  • 4
  • 26
  • 47
  • Have you checked if there is skew on column ID? try to run something similar to `select id,count(*) as cnt from inputs order by cnt desc limit 100` to see if there are any outliers that might be causing huge spike of data on executor #66 – Pushkr Apr 04 '19 at 00:29
  • How do you actually know that data in Redshift takes 15 GB? – dytyniak Apr 04 '19 at 14:10
  • @Pushkr My data is definitely skewed, but that doesn't explain why the shuffle spill would be 10x the size of the original input. @dytyniak Redshift tracks table MB size in a table called `svv_table_info`: https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_TABLE_INFO.html – user554481 Apr 04 '19 at 14:35
  • Do you mind to share your spark-submit command ? – maogautam Apr 04 '19 at 16:31
  • @PrateekPrateek I get these errors when running pyspark interactively. The command I use to start the shell is this: `pyspark --jars $JARS --driver-java-options -Dlog4j.configuration=file:///etc/alternatives/spark-conf/log4j.properties` – user554481 Apr 04 '19 at 19:39
  • you have to provide proper memory. Try pyspark --master yarn --driver-memory 10G --num-executors 5 --executor-memory 20G --executor-cores 5 . here master can be yarn or local and adjust memory as per cluster – maogautam Apr 04 '19 at 19:57

0 Answers0