I'm having the following problem that I haven't been able to solve.
I have a very large (several TB and around 14.000 partitions) hive partitioned table on ORC format that I want to access through the PySpark SQL API. The problem I have is that when running the query, all the partitions get consulted even when spark.sql.hive.metastorePartitionPruning is set to True.
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName("TestHiveIntegration")
.enableHiveSupport()
.getOrCreate()
)
spark.sql("""
SELECT col
FROM table
WHERE partition_1 = 1
AND partition_2 = 2
LIMIT 1
""")
This is interrupted by Java due to memory error:
[Stage 0:===================================> (7072 + 152) / 10000]#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 9755"...
This is not a partition pruining problem, because the exception is raised when the table is being read, not even when the query is being executed:
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName("TestHiveIntegration")
.enableHiveSupport()
.getOrCreate()
)
spark.read.table("table") # This causes the same error
Also, when reading a smaller table this isn't a problem, because all the partitions can be skimmed without a problem.
Is there a way to be able to use the Hive integration to query the partition I want directly without Spark having to skim through all possible partitions?
Update:
- Spark and Hive versions are Hive 2.3.6, Spark 2.4.4.
- The table is stored on S3.