How to split a dataframe into dataframes with same column values?

Question

Using Scala, how can I split dataFrame into multiple dataFrame (be it array or collection) with same column value. For example I want to split the following DataFrame:

ID  Rate    State
1   24  AL
2   35  MN
3   46  FL
4   34  AL
5   78  MN
6   99  FL

to:

data set 1

ID  Rate    State
1   24  AL  
4   34  AL

data set 2

ID  Rate    State
2   35  MN
5   78  MN

data set 3

ID  Rate    State
3   46  FL
6   99  FL

Why do you need to split the dataframe in multiple dataframes?. As probably you know you can filter and transform your dataFrame To: [(AL,Seq(24 AL, 4 34 AL)), (MN, Seq(35 MN, 5 78 MN)), (FL, Seq(46 FL 6 99 FL))] Using groupBy. — JoseM LM, Jul 28 '15 at 07:41
groupBy gives GroupDate type, how can I convert that to Array? — user1735076, Jul 28 '15 at 08:06

score 23 · Answer 1 · edited Oct 09 '17 at 16:30

You can collect unique state values and simply map over resulting array:

val states = df.select("State").distinct.collect.flatMap(_.toSeq)
val byStateArray = states.map(state => df.where($"State" <=> state))

or to map:

val byStateMap = states
    .map(state => (state -> df.where($"State" <=> state)))
    .toMap

The same thing in Python:

from itertools import chain
from pyspark.sql.functions import col

states = chain(*df.select("state").distinct().collect())

# PySpark 2.3 and later
# In 2.2 and before col("state") == state) 
# should give the same outcome, ignoring NULLs 
# if NULLs are important 
# (lit(state).isNull() & col("state").isNull()) | (col("state") == state)
df_by_state = {state: 
  df.where(col("state").eqNullSafe(state)) for state in states}

The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation. If you're looking for a way to just split the output see also How do I split an RDD into two or more RDDs?

In particular you can write Dataset partitioned by the column of interest:

val path: String = ???
df.write.partitionBy("State").parquet(path)

and read back if needed:

// Depend on partition prunning
for { state <- states } yield spark.read.parquet(path).where($"State" === state)

// or explicitly read the partition
for { state <- states } yield spark.read.parquet(s"$path/State=$state")

Depending on the size of the data, number of levels of the splitting, storag and persistence level of the input it might faster or slower than multiple filters.

Maybe Kind of late question. But when I try the python Code in Spark 2.2.0 I always get a "Column is not callable" error. I tried several approaches but still I get the same error. Any Workarounds for this? — inneb, Oct 08 '17 at 11:16
you need to import `col` with `from pyspark.sql.functions import col ` — Luis, Mar 28 '18 at 16:04

score 2 · Answer 2 · answered Sep 26 '17 at 07:26

2

It is very simple (if the spark version is 2) if you make the dataframe as a temporary table.

df1.createOrReplaceTempView("df1")

And now you can do the queries,

var df2 = spark.sql("select * from df1 where state = 'FL'")
var df3 = spark.sql("select * from df1 where state = 'MN'")
var df4 = spark.sql("select * from df1 where state = 'AL'")

Now you got the df2, df3, df4. If you want to have them as list, you can use,

df2.collect()
df3.collect()

or even map/filter function. Please refer https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes

Ash

answered Sep 26 '17 at 07:26

ashK

713
2
11
24

is there a possibility to loop SQL queries in spark? Collecting all distinct values before and then replacing the "where state = 'FL'" with "where state = 'i'" or something like this? – inneb Oct 08 '17 at 11:11
It will be overhead but still you can handle it using Spark Dataframes and SCALA coding – ashK Nov 10 '17 at 13:38
I used the same to split a DF into 5 sub-DF for doing left joins, the resultant DF is a view and not an independent DF on its own, its messing with left joins, can I split into independent DF ?? – Sandeep540 Oct 14 '19 at 16:30

score 0 · Answer 3 · edited Sep 16 '22 at 22:02

0

you can use ..

var stateDF = df.select("state").distinct()  // to get states in a df
val states = stateDF.rdd.map(x=>x(0)).collect.toList //to get states in a list

for (i <- states)  //loop to get each state
{
    var finalDF = sqlContext.sql("select * from table1 where state = '" + state
    +"' ")
}

edited Sep 16 '22 at 22:02

The Guy with The Hat

10,836
8
57
75

answered Oct 31 '19 at 12:09

Ruthika jawar

11
2

How to split a dataframe into dataframes with same column values?

3 Answers3

Linked

Related