1

I'm having the following problem that I haven't been able to solve.

I have a very large (several TB and around 14.000 partitions) hive partitioned table on ORC format that I want to access through the PySpark SQL API. The problem I have is that when running the query, all the partitions get consulted even when spark.sql.hive.metastorePartitionPruning is set to True.

from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName("TestHiveIntegration")
    .enableHiveSupport()
    .getOrCreate()
)
spark.sql("""
    SELECT col
    FROM table
    WHERE partition_1 = 1
    AND partition_2 = 2
    LIMIT 1
""")

This is interrupted by Java due to memory error:

[Stage 0:===================================>              (7072 + 152) / 10000]#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 9755"...

This is not a partition pruining problem, because the exception is raised when the table is being read, not even when the query is being executed:

from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName("TestHiveIntegration")
    .enableHiveSupport()
    .getOrCreate()
)
spark.read.table("table")  # This causes the same error

Also, when reading a smaller table this isn't a problem, because all the partitions can be skimmed without a problem.

Is there a way to be able to use the Hive integration to query the partition I want directly without Spark having to skim through all possible partitions?


Update:

  • Spark and Hive versions are Hive 2.3.6, Spark 2.4.4.
  • The table is stored on S3.
Felipe Gonzalez
  • 343
  • 1
  • 6
  • can you try to add `spark = ( SparkSession .builder .appName("TestHiveIntegration") .enableHiveSupport().config("spark.driver.memory", "10g") .getOrCreate() )` while you create your spark session – Jay Kakadiya Jan 29 '20 at 15:26
  • What is your Spark version? See https://issues.apache.org/jira/browse/SPARK-16980. And what is the storage layer (on-prem hdfs or cloud)? – mazaneicha Jan 29 '20 at 16:25
  • @mazaneicha I updated the question to include those values. @JayKakadiya I tested it with `8g` before posting the question and it still wasnt enough, got the same result – Felipe Gonzalez Jan 29 '20 at 18:17
  • Okay so... object stores have their own problems. https://stackoverflow.com/questions/39513505/spark-lists-all-leaf-node-even-in-partitioned-data? – mazaneicha Jan 29 '20 at 21:25

0 Answers0