Spark - Csv data split with scala

Question

test.csv
name,key1,key2
A,1,2
B,1,3
C,4,3

I want to change this data like this (as dataset or rdd)

whatIwant.csv
name,key,newkeyname
A,1,KEYA
A,2,KEYB
B,1,KEYA
B,3,KEYB
C,4,KEYA
C,3,KEYB

I loaded data with read method.

val df = spark.read
            .option("header", true)
            .option("charset", "euc-kr")
            .csv(csvFilePath)

I can load each dataset like (name, key1) or (name, key2), and union them by union, but want to do this in single spark session. Any idea of this?

Those are not working.

val df2 = df.select( df("TAG_NO"), df.map { x => (x.getAs[String]("MK_VNDRNM"), x.getAs[String]("WK_ORD_DT")) })

val df2 = df.select( df("TAG_NO"), Seq(df("TAG_NO"), df("WK_ORD_DT")))

Since key1 and key2 are not in single column, I think explode is not the right answer. — J.Done, Nov 15 '16 at 02:13
You can convert key1, key2 as tuple by applying map function. — Shankar, Nov 15 '16 at 02:16

evan.oman · Accepted Answer · 2016-11-15T05:24:17.630

This can be accomplished with explode and a udf:

scala> val df = Seq(("A", 1, 2), ("B", 1, 3), ("C", 4, 3)).toDF("name", "key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, key1: int ... 1 more field]

scala> df.show
+----+----+----+
|name|key1|key2|
+----+----+----+
|   A|   1|   2|
|   B|   1|   3|
|   C|   4|   3|
+----+----+----+

scala> val explodeUDF = udf((v1: String, v2: String) => Vector((v1, "Key1"), (v2, "Key2")))
explodeUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true),Some(List(StringType, StringType)))

scala> df = df.withColumn("TMP", explode(explodeUDF($"key1", $"key2"))).drop("key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string>]

scala> df = df.withColumn("key", $"TMP".apply("_1")).withColumn("new key name", $"TMP".apply("_2"))
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string> ... 2 more fields]

scala> df = df.drop("TMP")
df: org.apache.spark.sql.DataFrame = [name: string, key: string ... 1 more field]

scala> df.show
+----+---+------------+
|name|key|new key name|
+----+---+------------+
|   A|  1|        Key1|
|   A|  2|        Key2|
|   B|  1|        Key1|
|   B|  3|        Key2|
|   C|  4|        Key1|
|   C|  3|        Key2|
+----+---+------------+

profit! It's bit different from my origin problem but can make it with this. thanks alot :) — J.Done, Nov 15 '16 at 05:34

Spark - Csv data split with scala

1 Answers1