Split 1 column into 3 columns in spark scala

Question

I have a dataframe in Spark using scala that has a column that I need split.

scala> test.show
+-------------+
|columnToSplit|
+-------------+
|        a.b.c|
|        d.e.f|
+-------------+

I need this column split out to look like this:

+--------------+
|col1|col2|col3|
|   a|   b|   c|
|   d|   e|   f|
+--------------+

I'm using Spark 2.0.0

Thanks

score 99 · Accepted Answer · edited Apr 13 '20 at 06:34

99

Try:

import sparkObject.spark.implicits._
import org.apache.spark.sql.functions.split

df.withColumn("_tmp", split($"columnToSplit", "\\.")).select(
  $"_tmp".getItem(0).as("col1"),
  $"_tmp".getItem(1).as("col2"),
  $"_tmp".getItem(2).as("col3")
)

The important point to note here is that the sparkObject is the SparkSession object you might have already initialized. So, the (1) import statement has to be compulsorily put inline within the code, not before the class definition.

edited Apr 13 '20 at 06:34

Manindar

999
2
14
30

answered Aug 31 '16 at 17:51

1

Works great, Thanks! – Matt Maurer Sep 01 '16 at 19:31
1

what import is required for split? – Jake Aug 30 '17 at 20:46
8

@Jake import org.apache.spark.sql.functions.split use this – Surendra Pratap Dec 28 '17 at 08:18
1

@Surendra Pratap thank you! i was searched 30 min for that :) – Beyhan Gul Sep 17 '18 at 10:48
`import spark.implicits._` to use the $-notation. – Pawel Kranzberg Feb 07 '19 at 07:05
Just additional info, you use \\ before . for escaping. – shivam13juna Nov 03 '20 at 11:31
using \\ slashes does the magic without that it does not work. Thanks – Nikunj Kakadiya Nov 24 '20 at 16:51
Please avoid using indexes for getting the values out. Such a solution is never going to be dynamic. For this simple case with 3 columns, this is fine, but it does not look good or dynamic enough for more number of columns! – joesan Nov 12 '21 at 05:29

Psidom · Answer 2 · 2019-01-24T15:38:21.470

44

To do this programmatically, you can create a sequence of expressions with (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")) (assume you need 3 columns as result) and then apply it to select with : _* syntax:

df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
    (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+----+----+----+
|col0|col1|col2|
+----+----+----+
|   a|   b|   c|
|   d|   e|   f|
+----+----+----+

To keep all columns:

df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
    col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+-------------+---------+----+----+----+
|columnToSplit|     temp|col0|col1|col2|
+-------------+---------+----+----+----+
|        a.b.c|[a, b, c]|   a|   b|   c|
|        d.e.f|[d, e, f]|   d|   e|   f|
+-------------+---------+----+----+----+

If you are using pyspark, use a list comprehension to replace the map in scala:

df = spark.createDataFrame([['a.b.c'], ['d.e.f']], ['columnToSplit'])
from pyspark.sql.functions import col, split

(df.withColumn('temp', split('columnToSplit', '\\.'))
   .select(*(col('temp').getItem(i).alias(f'col{i}') for i in range(3))
).show()
+----+----+----+
|col0|col1|col2|
+----+----+----+
|   a|   b|   c|
|   d|   e|   f|
+----+----+----+

edited Jan 24 '19 at 15:38

answered Aug 31 '17 at 03:14

Psidom

209,562
33
339
356

2

Can we split to the arbitrary number of columns? Why 3 has to be hardcoded? – Ivan Balashov Oct 08 '18 at 11:13
1

I have similar scenario, but number of "." separated values are unknown. how can we split the rows dynamically? – Jeevan Jan 10 '19 at 17:09
There are multiple nice answers here because they each break things down a little further, but if you have a lot of columns or a lot of dataframes to do this with (for example reading many topics from Kafka), then this answer is the way to go. – NYCeyes Jan 18 '19 at 00:42
How does the first answer (the not all columns version) translate in pyspark/python? – NYCeyes Jan 18 '19 at 00:46
I created another answer to show how to implement this approach without hardcoding the number of columns. – Powers Mar 17 '19 at 10:36

Sascha Vetter · Answer 3 · 2018-09-24T20:16:07.190

A solution which avoids the select part. This is helpful when you just want to append the new columns:

case class Message(others: String, text: String)

val r1 = Message("foo1", "a.b.c")
val r2 = Message("foo2", "d.e.f")

val records = Seq(r1, r2)
val df = spark.createDataFrame(records)

df.withColumn("col1", split(col("text"), "\\.").getItem(0))
  .withColumn("col2", split(col("text"), "\\.").getItem(1))
  .withColumn("col3", split(col("text"), "\\.").getItem(2))
  .show(false)

+------+-----+----+----+----+
|others|text |col1|col2|col3|
+------+-----+----+----+----+
|foo1  |a.b.c|a   |b   |c   |
|foo2  |d.e.f|d   |e   |f   |
+------+-----+----+----+----+

Update: I highly recommend to use Psidom's implementation to avoid splitting three times.

score 9 · Answer 4 · answered May 02 '18 at 01:11

This appends columns to the original DataFrame and doesn't use select, and only splits once using a temporary column:

import spark.implicits._

df.withColumn("_tmp", split($"columnToSplit", "\\."))
  .withColumn("col1", $"_tmp".getItem(0))
  .withColumn("col2", $"_tmp".getItem(1))
  .withColumn("col3", $"_tmp".getItem(2))
  .drop("_tmp")

score 5 · Answer 5 · answered Mar 17 '19 at 10:35

This expands on Psidom's answer and shows how to do the split dynamically, without hardcoding the number of columns. This answer runs a query to calculate the number of columns.

val df = Seq(
  "a.b.c",
  "d.e.f"
).toDF("my_str")
.withColumn("letters", split(col("my_str"), "\\."))

val numCols = df
  .withColumn("letters_size", size($"letters"))
  .agg(max($"letters_size"))
  .head()
  .getInt(0)

df
  .select(
    (0 until numCols).map(i => $"letters".getItem(i).as(s"col$i")): _*
  )
  .show()

score 2 · Answer 6 · answered Sep 10 '20 at 08:43

We can write using for with yield in Scala :-

If your number of columns exceeds just add it to desired column and play with it. :)

val aDF = Seq("Deepak.Singh.Delhi").toDF("name")
val desiredColumn = Seq("name","Lname","City")
val colsize = desiredColumn.size

val columList = for (i <- 0 until colsize) yield split(col("name"),".").getItem(i).alias(desiredColumn(i))

aDF.select(columList: _ *).show(false)

Output:-

+------+------+-----+--+
|name  |Lname |city |
+-----+------+-----+---+
|Deepak|Singh |Delhi|
+---+------+-----+-----+

If you don't need name column then, drop the column and just use withColumn.

score 0 · Answer 7 · answered May 27 '21 at 19:38

Example: Without using the select statement.

Lets assume we have a dataframe having a set of columns and we want to split a column having column name as name

import spark.implicits._

val columns = Seq("name","age","address")

val data = Seq(("Amit.Mehta", 25, "1 Main st, Newark, NJ, 92537"),
             ("Rituraj.Mehta", 28,"3456 Walnut st, Newark, NJ, 94732"))

var dfFromData = spark.createDataFrame(data).toDF(columns:_*)
dfFromData.printSchema()

val newDF = dfFromData.map(f=>{
val nameSplit = f.getAs[String](0).split("\\.").map(_.trim)
      (nameSplit(0),nameSplit(1),f.getAs[Int](1),f.getAs[String](2))
    })

val finalDF = newDF.toDF("First Name","Last Name", "Age","Address")

finalDF.printSchema()

finalDF.show(false)

output:

Split 1 column into 3 columns in spark scala

7 Answers7

Linked

Related