2

I have one big data of DataFrame A.

I want to apply some filter to that and make a DataFrame B, and make another DataFrame C for not filtered data.

In summary, it's similar to following pseudo code.

A.foreach(_ => {
  if (isFiltered(_)) addToDF_B()
  else addToDF_C()
})

And, B and C will be written to different tables.

I tried to filter B firstly and use A.except(B) to make C, but it doesn't work if scheme has complex type(map or array).

Except filtering twice, is any other way to do it at once?

Thanks in advance.

zero323
  • 322,348
  • 103
  • 959
  • 935
JaycePark
  • 415
  • 1
  • 5
  • 18
  • 1
    This question is worth linking, I think: https://stackoverflow.com/questions/32970709/how-do-i-split-an-rdd-into-two-or-more-rdds For RDDs no good, native way exists, but with additional libraries or by hacking partitions, it becomes possible. – Rick Moritz Jun 26 '17 at 10:36

2 Answers2

-1

You can use simple .filter api on dataframe A as

val A = Seq(
  (1, 22),
  (2, 11),
  (10, 3),
  (20, 4)
).toDF("col1", "col2")

A.show(false)

You should have A dataframe as

+----+----+
|col1|col2|
+----+----+
|1   |22  |
|2   |11  |
|10  |3   |
|20  |4   |
+----+----+

Define your filter as

def filter = col("col1") < 10

And apply them to have different dataframes as

val B = A.filter(filter)
B.show(false)

Output is

+----+----+
|col1|col2|
+----+----+
|1   |22  |
|2   |11  |
+----+----+

And your C dataframe is opposite of B dataframe as

val C = A.filter(!filter)
C.show(false)

Output is

+----+----+
|col1|col2|
+----+----+
|10  |3   |
|20  |4   |
+----+----+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
-1

You can also do it using SparkSql

val A = Seq(
  (1, 22),
  (2, 11),
  (10, 3),
  (20, 4)
).toDF("col1", "col2")

A.show(false)

val B = spark.sql(select * from A /*where = your condition for B*/) //spark is SparkSession or SQLContext
val C = spark.sql(select * from A /*where = your condition for C*/) //spark is SparkSession or SQLContext
Rahul Kanodiya
  • 315
  • 2
  • 6