1

Here's what my dataframe looks like:

p3.show(false)

CLASS_NAME          ID    CREATED_BY
/SC/ABC/123/abc     123    david
/SC/DEF/456/ghi     456    hannah
... more rows to follow

what I'd like to do is: split CLASS_NAME column and concatenate the first two string parts of it and form a new column: CLIENT_ID and append to the dataframe, desired output is below:

CLASS_NAME          ID    CREATED_BY  CLIENT_ID
/SC/ABC/123/abc     xyz    david      /SC/ABC
/SC/DEF/456/ghi     jfk    hannah     /SC/DEF
... more rows to follow

So far, I was able to follow this answer and split them using below command:

import org.apache.spark.sql.functions.split
import spark.implicits._

val p4 = p3.withColumn("CLIENT_ID", split($"CLASS_NAME", "\\/")).select(
  $"CLIENT_ID".getItem(1).as("col1"),
  $"CLIENT_ID".getItem(2).as("col2"),
  $"CLIENT_ID".getItem(3).as("col3")
)
p4.show(false)

col1 col2 col3
SC   ABC  123
SC   DEF  456
... more rows to follow

But I haven't figured out how to 1. concatenate the two strings; 2. append this new column to the original dataframe.

Any ideas would be greatly appreciated!

werner
  • 13,518
  • 6
  • 30
  • 45
Fisher Coder
  • 3,278
  • 12
  • 49
  • 84

1 Answers1

1

You can use concat to combine two of the three parts:

import org.apache.spark.sql.functions.concat
import org.apache.spark.sql.functions.lit

val p4 = p3.withColumn("CLIENT_ID", split($"CLASS_NAME", "\\/"))
  .withColumn("CLIENT_ID", concat(lit("/"), $"CLIENT_ID".getItem(1), lit("/"), $"CLIENT_ID".getItem(2)))

Another option would be to use regexp_extract:

import org.apache.spark.sql.functions.regexp_extract

val p4 = p3.withColumn("CLIENT_ID", regexp_extract($"CLASS_NAME", "(/[A-Z]+/[A-Z]+)/", 1))

Result in both cases:

+---------------+---+----------+---------+
|     CLASS_NAME| ID|CREATED_BY|CLIENT_ID|
+---------------+---+----------+---------+
|/SC/ABC/123/abc|123|     david|  /SC/ABC|
|/SC/DEF/456/ghi|456|    hannah|  /SC/DEF|
+---------------+---+----------+---------+
werner
  • 13,518
  • 6
  • 30
  • 45