In Spark, how can I split one DataFrame to two DataFrames?

Question

I have one big data of DataFrame A.

I want to apply some filter to that and make a DataFrame B, and make another DataFrame C for not filtered data.

In summary, it's similar to following pseudo code.

A.foreach(_ => {
  if (isFiltered(_)) addToDF_B()
  else addToDF_C()
})

And, B and C will be written to different tables.

I tried to filter B firstly and use A.except(B) to make C, but it doesn't work if scheme has complex type(map or array).

Except filtering twice, is any other way to do it at once?

Thanks in advance.

This question is worth linking, I think: https://stackoverflow.com/questions/32970709/how-do-i-split-an-rdd-into-two-or-more-rdds For RDDs no good, native way exists, but with additional libraries or by hacking partitions, it becomes possible. — Rick Moritz, Jun 26 '17 at 10:36

score -1 · Answer 1 · answered Jun 26 '17 at 04:49

-1

You can use simple .filter api on dataframe A as

val A = Seq(
  (1, 22),
  (2, 11),
  (10, 3),
  (20, 4)
).toDF("col1", "col2")

A.show(false)

You should have A dataframe as

+----+----+
|col1|col2|
+----+----+
|1   |22  |
|2   |11  |
|10  |3   |
|20  |4   |
+----+----+

Define your filter as

def filter = col("col1") < 10

And apply them to have different dataframes as

val B = A.filter(filter)
B.show(false)

Output is

+----+----+
|col1|col2|
+----+----+
|1   |22  |
|2   |11  |
+----+----+

And your C dataframe is opposite of B dataframe as

val C = A.filter(!filter)
C.show(false)

Output is

+----+----+
|col1|col2|
+----+----+
|10  |3   |
|20  |4   |
+----+----+

answered Jun 26 '17 at 04:49

Ramesh Maharjan

41,071
6
69
97

Sorry. Like I mentioned 'Except filtering twice', I'm asking better way. – JaycePark Jun 26 '17 at 04:54
For creating two dataframes, I guess you will have to do two operations, if not filter twice then filter once and other opeation once. – Ramesh Maharjan Jun 26 '17 at 05:02
Thanks. I wish there would be something like 'switch-case', but it looks no way for now. – JaycePark Jun 26 '17 at 05:56
There is match case. but that would be difficult to implement in your requirements. – Ramesh Maharjan Jun 26 '17 at 06:05
didn't the answer helped you @JaycePark? if it did you should consider accepting and upvote . if not you can downvote and I shall delete the post – Ramesh Maharjan Apr 03 '18 at 05:45

score -1 · Answer 2 · answered Jun 26 '17 at 06:46

You can also do it using SparkSql

val A = Seq(
  (1, 22),
  (2, 11),
  (10, 3),
  (20, 4)
).toDF("col1", "col2")

A.show(false)

val B = spark.sql(select * from A /*where = your condition for B*/) //spark is SparkSession or SQLContext
val C = spark.sql(select * from A /*where = your condition for C*/) //spark is SparkSession or SQLContext

In Spark, how can I split one DataFrame to two DataFrames?

2 Answers2