0

Let's say, we have Dataframe dfSource that is non-trivial (e.g. a result of different joins etc.) and of large size (e.g. 100k+ rows), and it has a column some_boolean, which I want to use to split, like this:

val dfTrue = dfSource.where(col("some_boolean") === true)
// write dfTrue, e.g. dfTrue.write.parquet("data1")
val dfFalse = dfSource.where(col("some_boolean") === false)
// write dfFalse, e.g. dfFalse.write.parquet("data2")

Now this would result to scanning and filtering the data twice, right? Is there any way to do this more efficiently? I thought of something like

val (dfTrue, dfFalse) = dfSource.split(col("some_boolean") === true)
// write dfTrue and dfFalse
user4157124
  • 2,809
  • 13
  • 27
  • 42
Salim
  • 2,446
  • 1
  • 14
  • 12
  • it depends on what you want to do after this step. until the where statement, it is not scanning the df. – Lamanus Mar 24 '20 at 13:52
  • Does this answer your question? [How do I split an RDD into two or more RDDs?](https://stackoverflow.com/questions/32970709/how-do-i-split-an-rdd-into-two-or-more-rdds) – user10938362 Mar 24 '20 at 14:05
  • 2
    you can `dfSource.cache()` your non-trivial dataframe so that it is computed once, at first write. – mazaneicha Mar 24 '20 at 14:13
  • @Lamanus as written in the following comment, I'd like to write those DFs to two different files or tables. – Salim Mar 29 '20 at 14:03

1 Answers1

0

I see that you store the output after splitting. You can use partitionPy when writing as follows:

dfSource = spark.createDataFrame([
    ['a', True],
    ['b', False],
    ['c', True],
    ['d', True],
    ['e', False],
    ['f', False]
], ["col1", "col2"]).cache()
dfSource.show()

+----+-----+
|col1| col2|
+----+-----+
|   a| true|
|   b|false|
|   c| true|
|   d| true|
|   e|false|
|   f|false|
+----+-----+


dfSource.write.partitionBy("col2").parquet("/tmp/df")

You will see these two directories /tmp/df/col2=true and /tmp/df/col2=false

Now you can read them as usual

dfTrue = spark.read.parquet("/tmp/df/col2=true")
dfTrue.show()
+----+
|col1|
+----+
|   a|
|   c|
|   d|
+----+

dfFalse = spark.read.parquet("/tmp/df/col2=false")
dfFalse.show()
+----+
|col1|
+----+
|   b|
|   e|
|   f|
+----+
Majid Alfifi
  • 568
  • 2
  • 5
  • 18