0

Is spark's lazy evaluation really executing anything for the following simple example of pointing to a partition of a hive table and getting columns but nothing really heavy:

>>> spark.sql('select * from default.test_table where day="2021-01-01"').columns
[Stage 0:===============================>                   (1547 + 164) / 2477]#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 28049"...
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o61.sql

I don't see why just pointing to a hive table takes much memory from PySpark (Version 2.4.3). Adding memory to driver and executor (driver-memory, executor-memory) only makes the query stuck forever without outputting any useful message. Is there a way to suppress PySpark from executing when just defining a data frame?

zhh210
  • 388
  • 4
  • 12

1 Answers1

0

You can put a limit on the query to avoid memory errors:

spark.sql('select * from default.test_table where day="2021-01-01" limit 1').columns
Shadowtrooper
  • 1,372
  • 15
  • 28
  • Adding limit doesn't work. Retrieving columns shouldn't rely on the data size anyway. The issue is that somehow pyspark keeps executing something even though what I need is just the columns. – zhh210 Feb 15 '21 at 23:24