Persisting a data frame in pyspark2 does not work when a storage level is specified. What am I doing wrong?

Question

I am trying to persist two very large data frames before performing a join to workaround the "java.util.concurrent.TimeoutException: Futures timed out..." issue (ref: Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?).

Persist(), alone, works but when I try to specify a storage level, I receive name errors.

I've tried the following:

df.persist(pyspark.StorageLevel.MEMORY_ONLY) 
NameError: name 'MEMORY_ONLY' is not defined

df.persist(StorageLevel.MEMORY_ONLY) 
NameError: name 'StorageLevel' is not defined

import org.apache.spark.storage.StorageLevel 
ImportError: No module named org.apache.spark.storage.StorageLevel

Any help would be greatly appreciated.

score 8 · Accepted Answer · answered Nov 22 '19 at 04:08

8

You will have to import the appropriate package:-

from pyspark import StorageLevel

answered Nov 22 '19 at 04:08

Jayadeep Jayaraman

2,747
3
15
26

score 3 · Answer 2 · answered Mar 13 '20 at 14:49

3

The following works for me:

from pyspark.storagelevel import StorageLevel

df.persist(StorageLevel.MEMORY_ONLY)

answered Mar 13 '20 at 14:49

ngoc thoag

73
2

score 0 · Answer 3 · answered Nov 22 '19 at 11:46

0

Import the pyspark package

import pyspark

answered Nov 22 '19 at 11:46

Rahul

717
9
16

Persisting a data frame in pyspark2 does not work when a storage level is specified. What am I doing wrong?

3 Answers3