2

I am trying to persist two very large data frames before performing a join to workaround the "java.util.concurrent.TimeoutException: Futures timed out..." issue (ref: Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?).

Persist(), alone, works but when I try to specify a storage level, I receive name errors.

I've tried the following:

df.persist(pyspark.StorageLevel.MEMORY_ONLY) 
NameError: name 'MEMORY_ONLY' is not defined

df.persist(StorageLevel.MEMORY_ONLY) 
NameError: name 'StorageLevel' is not defined

import org.apache.spark.storage.StorageLevel 
ImportError: No module named org.apache.spark.storage.StorageLevel

Any help would be greatly appreciated.

Jayadeep Jayaraman
  • 2,747
  • 3
  • 15
  • 26
OTM
  • 186
  • 3
  • 14

3 Answers3

8

You will have to import the appropriate package:-

from pyspark import StorageLevel
Jayadeep Jayaraman
  • 2,747
  • 3
  • 15
  • 26
3

The following works for me:

from pyspark.storagelevel import StorageLevel

df.persist(StorageLevel.MEMORY_ONLY)
ngoc thoag
  • 73
  • 2
0

Import the pyspark package

import pyspark
Rahul
  • 717
  • 9
  • 16