Let's say, we have Dataframe dfSource
that is non-trivial (e.g. a result of different joins etc.) and of large size (e.g. 100k+ rows), and it has a column some_boolean
, which I want to use to split, like this:
val dfTrue = dfSource.where(col("some_boolean") === true)
// write dfTrue, e.g. dfTrue.write.parquet("data1")
val dfFalse = dfSource.where(col("some_boolean") === false)
// write dfFalse, e.g. dfFalse.write.parquet("data2")
Now this would result to scanning and filtering the data twice, right? Is there any way to do this more efficiently? I thought of something like
val (dfTrue, dfFalse) = dfSource.split(col("some_boolean") === true)
// write dfTrue and dfFalse