3

I have a very large Spark DataFrame with a number of columns, and I want to make an informed judgement about whether or not to keep them in my pipeline, in part based on how big they are. By "how big," I mean the size in bytes in RAM when this DataFrame is cached, which I expect to be a decent estimate for the computational cost of processing this data. Some columns are simple types (e.g. doubles, integers) but others are complex types (e.g. arrays and maps of variable length).

An approach I have tried is to cache the DataFrame without and then with the column in question, check out the Storage tab in the Spark UI, and take the difference. But this is an annoying and slow exercise for a DataFrame with a lot of columns.

I typically use PySpark so a PySpark answer would be preferable, but Scala would be fine as well.

abeboparebop
  • 7,396
  • 6
  • 37
  • 46

1 Answers1

2

I found a solution which builds off of this related answer: https://stackoverflow.com/a/49529028.

Assuming I'm working with a dataframe called df and a SparkSession object called spark:

import org.apache.spark.sql.{functions => F}

// force the full dataframe into memory (could specify persistence
// mechanism here to ensure that it's really being cached in RAM)
df.cache()
df.count()

// calculate size of full dataframe
val catalystPlan = df.queryExecution.logical
val dfSizeBytes = spark.sessionState.executePlan(catalystPlan).optimizedPlan.stats.sizeInBytes

for (col <- df.columns) {
    println("Working on " + col)

    // select all columns except this one:
    val subDf = df.select(df.columns.filter(_ != col).map(F.col): _*)
    
    // force subDf into RAM
    subDf.cache()
    subDf.count()

    // calculate size of subDf
    val catalystPlan = subDf.queryExecution.logical
    val subDfSizeBytes = spark.sessionState.executePlan(catalystPlan).optimizedPlan.stats.sizeInBytes

    // size of this column as a fraction of full dataframe
    val colSizeFrac = (dfSizeBytes - subDfSizeBytes).toDouble / dfSizeBytes.toDouble
    println("Column space fraction is " + colSizeFrac * 100.0 + "%")
    subDf.unpersist()
}

Some confirmations that this approach gives sensible results:

  1. The reported column sizes add up to 100%.
  2. Simple type columns like integers or doubles take up the expected 4 bytes or 8 bytes per row.
Koedlt
  • 4,286
  • 8
  • 15
  • 33
abeboparebop
  • 7,396
  • 6
  • 37
  • 46