I'm getting a ResourceWarning in every unit test I run on Spark like this:
/opt/conda/lib/python3.9/socket.py:775: ResourceWarning: unclosed <socket.socket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 37512), raddr=('127.0.0.1', 38975)>
self._sock = None
ResourceWarning: Enable tracemalloc to get the object allocation traceback
I tracked it down to DataFrame.toPandas()
. Example:
import unittest
from pyspark.sql import SparkSession
class PySparkTestCase(unittest.TestCase):
def test_convert_to_pandas_df(self):
spark = SparkSession.builder.master("local[2]").getOrCreate()
rawData = spark.range(10)
print("XXX 1")
pdfData = rawData.toPandas()
print("XXX 2")
print(pdfData)
if __name__ == '__main__':
unittest.main(verbosity=2)
You'll see the 2 ResourceWarnings just before the XXX 2
output line.
However, if you run the same code outside unittest, you won't get the resource warning!
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[2]").getOrCreate()
rawData = spark.range(10)
print("XXX 1")
pdfData = rawData.toPandas()
print("XXX 2")
print(pdfData)
So, is unittest doing something to cause this resource warning in toPandas()
? I appreciate I could hide the resource warning (e.g., see here or here), but I'd rather not get the resource warning in the first place!