1

If we are creating dataFrame from python objects like [dict or list], even if python data should be collect by GC, but as far as dataFrame is lazy, it needs to keep source of the data. So every python object is kept in memory of interpreter. If such objects are really big it can cause crush of spark execution as python interpreter can't get enough memory. Am i right?

def get_df():
    # d should be collected by GC after leaving body of function
    d = {'a': 1}
    df = spark.createDataFrame(d)
    return df
df = get_df()
# d still in memory of interpreter
Sergii V.
  • 191
  • 1
  • 13
  • 1
    You're somewhat wrong but the outcome is pretty much the same. Reference to `d` could be safely removed here, however local `Dataset` that is created here will still occupy proportional amount of the driver memory - this is similar, though not identical to, [_Why does SparkContext.parallelize use memory of the driver?_](https://stackoverflow.com/q/46262236/6910411). – zero323 Dec 11 '18 at 12:17
  • @user6910411 So if source data of dataFrame can't be reached in another way, decision will be to store this dataFrame in pyspark friendly format(like parquet, jdbc, etc) and recreate Df sourced by new data, to make avilable lazy reading(data not need to be stored in driver)? – Sergii V. Dec 11 '18 at 12:26
  • 1
    Sounds about right, just remember that to make it work, you'll need some for of shared storage. – zero323 Dec 11 '18 at 12:28

0 Answers0