5

Is it possible to divide DF in two parts using single filter operation.For example

let say df has below records

UID    Col
 1       a
 2       b
 3       c

if I do

df1 = df.filter(UID <=> 2)

can I save filtered and non-filtered records in different RDD in single operation ?

 df1 can have records where uid = 2
 df2 can have records with uid 1 and 3 
user2895589
  • 1,010
  • 4
  • 20
  • 33

1 Answers1

7

If you're interested only in saving data you can add an indicator column to the DataFrame:

val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)

and use it as a partition column for the DataFrameWriter with one of the supported formats (as for 1.6 it is Parquet, text, and JSON):

dfWithInd.write.partitionBy("ind").parquet(...)

It will create two separate directories (ind=false, ind=true) on write.

In general though, it is not possible to yield multiple RDDs or DataFrames from a single transformation. See How to split a RDD into two or more RDDs?

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935