Reading large partitioned table in pyspark causes memory error even when accessing single partition

Question

I'm having the following problem that I haven't been able to solve.

I have a very large (several TB and around 14.000 partitions) hive partitioned table on ORC format that I want to access through the PySpark SQL API. The problem I have is that when running the query, all the partitions get consulted even when spark.sql.hive.metastorePartitionPruning is set to True.

from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName("TestHiveIntegration")
    .enableHiveSupport()
    .getOrCreate()
)
spark.sql("""
    SELECT col
    FROM table
    WHERE partition_1 = 1
    AND partition_2 = 2
    LIMIT 1
""")

This is interrupted by Java due to memory error:

[Stage 0:===================================>              (7072 + 152) / 10000]#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 9755"...

This is not a partition pruining problem, because the exception is raised when the table is being read, not even when the query is being executed:

from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName("TestHiveIntegration")
    .enableHiveSupport()
    .getOrCreate()
)
spark.read.table("table")  # This causes the same error

Also, when reading a smaller table this isn't a problem, because all the partitions can be skimmed without a problem.

Is there a way to be able to use the Hive integration to query the partition I want directly without Spark having to skim through all possible partitions?

Update:

Spark and Hive versions are Hive 2.3.6, Spark 2.4.4.
The table is stored on S3.

can you try to add `spark = ( SparkSession .builder .appName("TestHiveIntegration") .enableHiveSupport().config("spark.driver.memory", "10g") .getOrCreate() )` while you create your spark session — Jay Kakadiya, Jan 29 '20 at 15:26
What is your Spark version? See https://issues.apache.org/jira/browse/SPARK-16980. And what is the storage layer (on-prem hdfs or cloud)? — mazaneicha, Jan 29 '20 at 16:25
@mazaneicha I updated the question to include those values. @JayKakadiya I tested it with `8g` before posting the question and it still wasnt enough, got the same result — Felipe Gonzalez, Jan 29 '20 at 18:17
Okay so... object stores have their own problems. https://stackoverflow.com/questions/39513505/spark-lists-all-leaf-node-even-in-partitioned-data? — mazaneicha, Jan 29 '20 at 21:25

Reading large partitioned table in pyspark causes memory error even when accessing single partition

0 Answers0