I have a very large Spark DataFrame with a number of columns, and I want to make an informed judgement about whether or not to keep them in my pipeline, in part based on how big they are. By "how big," I mean the size in bytes in RAM when this DataFrame is cached, which I expect to be a decent estimate for the computational cost of processing this data. Some columns are simple types (e.g. doubles, integers) but others are complex types (e.g. arrays and maps of variable length).
An approach I have tried is to cache the DataFrame without and then with the column in question, check out the Storage tab in the Spark UI, and take the difference. But this is an annoying and slow exercise for a DataFrame with a lot of columns.
I typically use PySpark so a PySpark answer would be preferable, but Scala would be fine as well.