Fast split Spark dataframe by keys in some column and save as different dataframes

Question

I have Spark 2.3 very big dataframe like this:

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AB |    2 |    1 |
|      AA |    2 |    3 |
|      AC |    1 |    2 |
|      AA |    3 |    2 |
|      AC |    5 |    3 |
-------------------------

I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AA |    1 |    2 |
|      AA |    2 |    3 |
|      AA |    3 |    2 |
-------------------------

and

-------------------------
| col_key | col1 | col2 |
-------------------------
|      AC |    1 |    2 |
|      AC |    5 |    3 |
-------------------------

and so far. Every result dataframe I need to save as different csv file.

Count of keys is not big (20-30) but total count of data is (~200 millions records).

I have the solution where in the loop is selected every part of data and then saved to file:

val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList

keysList.foreach(k => {
      val dfi = df.where($"col_key" === lit(k))
      SaveDataByKey(dfi, path_to_save)
    })

It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time. I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :) May be, someone have ideas about it?

Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).

What does `SaveDataByKey` do? Is what you want to do simply to save the dataframe into different folders partitioned on the `col_key` column? — Shaido, May 09 '19 at 08:58
I need to save data (extracted by different keys) to different csv files. ```SaveDataByKey``` does exactly this. — Ihor Konovalenko, May 09 '19 at 09:02
Possible duplicate of [How do I split an RDD into two or more RDDs?](https://stackoverflow.com/questions/32970709/how-do-i-split-an-rdd-into-two-or-more-rdds) — user10938362, May 09 '19 at 09:06
I don't think it is duplicate because I want to use Spark's DataFrame API only. If it is possible. — Ihor Konovalenko, May 09 '19 at 09:08

score 1 · Accepted Answer · answered May 09 '19 at 09:05

1

You need to partition by column and save as csv file. Each partition save as one file.

yourDF
  .write
  .partitionBy("col_key")
  .csv("/path/to/save")

Why don't you try this ?

answered May 09 '19 at 09:05

Muhunthan

413
2
5
15

Thanks, it can be acceptable for my task. But there are two notes: 1) need column "col_key" also presented in result data; 2) need that result csv file would be single for each key. – Ihor Konovalenko May 09 '19 at 12:36
Regarding note 1): we can create duplicate column for ```col_key``` so that this duplicate would be in result data also. But what about automatically combine all csv files (created for some key) into one csv? – Ihor Konovalenko May 09 '19 at 13:14

Fast split Spark dataframe by keys in some column and save as different dataframes

1 Answers1