0

I'm a beginner with Spark, I have Avro records in the dataset and I'm getting the DataSet created from with those records.

DataDataset<Row> ds = spark.read().format("com.databricks.spark.avro)
.option("avroSchema,schema.toString().load(./*.avro);

One of my column values looks like

+--------------------------+
|           col1           |
| VCE_B_WSI_20180914_573   |
| WCE_C_RTI_20181223_324   |
---------------------------+  

I would want to split this column multiple columns and would like to group by on this new columns, like below

+------------------+
|col1  |col2|col3  |
|   VCE|   B|   WSI|
|   WCE|   C|   RTI|
+------------------+

I would really appreciate any tips on how should I go about doing this? Should I convert the dataset to RDD and apply these transformations but i'm not sure if i can add new columns in RDD.

user3679686
  • 516
  • 1
  • 6
  • 20

2 Answers2

0

You can do this by calling withColumn function on dataframe. You can use regular expression function on column to get specific part of it. Since you are looking for 3 new columns, you can call same function 3 times. If you do not need original column then you can call drop function at end.

Ramdev Sharma
  • 974
  • 1
  • 12
  • 17
0

Try the following

 {
        val d = ds.map(r => r.getString(0).split('_'))
             .withColumn("col1", col("value")(0))
             .withColumn("col2", col("value")(1))
             .withColumn("col3", col("value")(2))
             .drop(col("value")).show
 }
Sankumarsingh
  • 9,889
  • 11
  • 50
  • 74
Tomasz Krol
  • 596
  • 6
  • 23