Derive multiple columns from a single column in Spark DataSet and Group By on the new Columns

Question

I'm a beginner with Spark, I have Avro records in the dataset and I'm getting the DataSet created from with those records.

DataDataset<Row> ds = spark.read().format("com.databricks.spark.avro)
.option("avroSchema,schema.toString().load(./*.avro);

One of my column values looks like

+--------------------------+
|           col1           |
| VCE_B_WSI_20180914_573   |
| WCE_C_RTI_20181223_324   |
---------------------------+

I would want to split this column multiple columns and would like to group by on this new columns, like below

+------------------+
|col1  |col2|col3  |
|   VCE|   B|   WSI|
|   WCE|   C|   RTI|
+------------------+

I would really appreciate any tips on how should I go about doing this? Should I convert the dataset to RDD and apply these transformations but i'm not sure if i can add new columns in RDD.

score 0 · Answer 1 · answered Sep 20 '18 at 18:31

You can do this by calling withColumn function on dataframe. You can use regular expression function on column to get specific part of it. Since you are looking for 3 new columns, you can call same function 3 times. If you do not need original column then you can call drop function at end.

score 0 · Accepted Answer · edited Sep 21 '18 at 08:54

0

Try the following

 {
        val d = ds.map(r => r.getString(0).split('_'))
             .withColumn("col1", col("value")(0))
             .withColumn("col2", col("value")(1))
             .withColumn("col3", col("value")(2))
             .drop(col("value")).show
 }

edited Sep 21 '18 at 08:54

Sankumarsingh

9,889
11
50
74

answered Sep 20 '18 at 21:52

Tomasz Krol

596
6
23

Derive multiple columns from a single column in Spark DataSet and Group By on the new Columns

2 Answers2