Efficiently splitting Spark DataFrame into two with filtering only once

Question

Let's say, we have Dataframe dfSource that is non-trivial (e.g. a result of different joins etc.) and of large size (e.g. 100k+ rows), and it has a column some_boolean, which I want to use to split, like this:

val dfTrue = dfSource.where(col("some_boolean") === true)
// write dfTrue, e.g. dfTrue.write.parquet("data1")
val dfFalse = dfSource.where(col("some_boolean") === false)
// write dfFalse, e.g. dfFalse.write.parquet("data2")

Now this would result to scanning and filtering the data twice, right? Is there any way to do this more efficiently? I thought of something like

val (dfTrue, dfFalse) = dfSource.split(col("some_boolean") === true)
// write dfTrue and dfFalse

it depends on what you want to do after this step. until the where statement, it is not scanning the df. — Lamanus, Mar 24 '20 at 13:52
Does this answer your question? [How do I split an RDD into two or more RDDs?](https://stackoverflow.com/questions/32970709/how-do-i-split-an-rdd-into-two-or-more-rdds) — user10938362, Mar 24 '20 at 14:05
you can `dfSource.cache()` your non-trivial dataframe so that it is computed once, at first write. — mazaneicha, Mar 24 '20 at 14:13
@Lamanus as written in the following comment, I'd like to write those DFs to two different files or tables. — Salim, Mar 29 '20 at 14:03

score 0 · Answer 1 · answered Mar 25 '20 at 02:49

I see that you store the output after splitting. You can use partitionPy when writing as follows:

dfSource = spark.createDataFrame([
    ['a', True],
    ['b', False],
    ['c', True],
    ['d', True],
    ['e', False],
    ['f', False]
], ["col1", "col2"]).cache()
dfSource.show()

+----+-----+
|col1| col2|
+----+-----+
|   a| true|
|   b|false|
|   c| true|
|   d| true|
|   e|false|
|   f|false|
+----+-----+


dfSource.write.partitionBy("col2").parquet("/tmp/df")

You will see these two directories /tmp/df/col2=true and /tmp/df/col2=false

Now you can read them as usual

dfTrue = spark.read.parquet("/tmp/df/col2=true")
dfTrue.show()
+----+
|col1|
+----+
|   a|
|   c|
|   d|
+----+

dfFalse = spark.read.parquet("/tmp/df/col2=false")
dfFalse.show()
+----+
|col1|
+----+
|   b|
|   e|
|   f|
+----+

But this would result to one additional write and two additional reads. — Salim, Mar 29 '20 at 14:05
But I thought you wanted to write those two dataframes to disk as in your comment to @Lamanus — Majid Alfifi, Mar 30 '20 at 16:24

Efficiently splitting Spark DataFrame into two with filtering only once

1 Answers1