Concatenate columns in Apache Spark DataFrame

Question

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

zero323 · Answer 1 · 2016-02-22T20:02:19.563

With raw SQL you can use CONCAT:

In Python

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

In Scala

import sqlContext.implicits._

val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")

Since Spark 1.5.0 you can use concat function with DataFrame API:

In Python :

from pyspark.sql.functions import concat, col, lit

df.select(concat(col("k"), lit(" "), col("v")))

In Scala :

import org.apache.spark.sql.functions.{concat, lit}

df.select(concat($"k", lit(" "), $"v"))

There is also concat_ws function which takes a string separator as the first argument.

Use `concat_ws()` to treat **null values** (c.f. below answers). — Benji, Apr 13 '23 at 15:53

score 65 · Answer 2 · edited Sep 06 '17 at 15:51

Here's how you can do custom naming

import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()

gives,

+--------+--------+
|colname1|colname2|
+--------+--------+
|   row11|   row12|
|   row21|   row22|
+--------+--------+

create new column by concatenating:

df = df.withColumn('joined_column', 
                    sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()

+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
|   row11|   row12|  row11_row12|
|   row21|   row22|  row21_row22|
+--------+--------+-------------+

4

`lit` creates a column of `_` – muon Aug 08 '17 at 18:30

Ignacio Alorre · Answer 3 · 2018-12-03T07:23:11.733

41

One option to concatenate string columns in Spark Scala is using concat.

It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.

Using concat and withColumn:

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))

Using concat and select:

val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")

With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.

edited Dec 03 '18 at 07:23

answered Mar 29 '18 at 07:03

Ignacio Alorre

7,307
8
57
94

1

I tried your method in pyspark but it did not work, warning "col should be Column". – Samson Nov 18 '19 at 15:41
@Samson sorry, I only checked for the Scala API – Ignacio Alorre Nov 19 '19 at 14:11
5

@IgnacioAlorre If you are using `concat_ws` instead of `concat`, you can avoid checking for NULL. – Aswath K Mar 03 '20 at 16:02

score 27 · Answer 4 · edited Mar 04 '22 at 04:47

concat(*cols)

v1.5 and higher

Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.

Eg: new_df = df.select(concat(df.a, df.b, df.c))

concat_ws(sep, *cols)

v1.5 and higher

Similar to concat but uses the specified separator.

Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))

map_concat(*cols)

v2.4 and higher

Used to concat maps, returns the union of all the given maps.

Eg: new_df = df.select(map_concat("map1", "map2"))

Using concat operator (||):

v2.3 and higher

Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")

Reference: Spark sql doc

Danish Shrestha · Answer 5 · 2015-07-20T22:27:48.057

If you want to do it using DF, you could use a udf to add a new column based on existing columns.

val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)

//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
    Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))

//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )

//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()

score 13 · Answer 6 · edited Nov 02 '18 at 11:36

13

From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.

For example;

val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")

edited Nov 02 '18 at 11:36

mrsrinivas

34,112
13
125
125

answered Apr 19 '18 at 14:09

Devas

1,544
4
23
28

score 12 · Answer 7 · edited Sep 06 '17 at 15:51

Here is another way of doing this for pyspark:

#import concat and lit functions from pyspark.sql.functions 
from pyspark.sql.functions import concat, lit

#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])

#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))

#Show the new data frame
personDF.show()

----------RESULT-------------------------

84
+------------+
|East African|
+------------+
|   Ethiopian|
|      Kenyan|
|     Ugandan|
|     Rwandan|
+------------+

score 9 · Answer 8 · edited Aug 17 '17 at 17:48

9

Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

edited Aug 17 '17 at 17:48

Paul Roub

36,322
27
84
93

answered Aug 17 '17 at 17:46

wones0120

111
1
3

score 3 · Answer 9 · edited Mar 10 '20 at 04:41

3

Do we have java syntax corresponding to below process

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

edited Mar 10 '20 at 04:41

Muhammad Dyas Yaskur

6,914
10
48
73

answered Mar 10 '20 at 04:13

Roopesh MB

31
3

score 2 · Answer 10 · answered Mar 12 '18 at 20:24

2

In Spark 2.3.0, you may do:

spark.sql( """ select '1' || column_a from table_a """)

answered Mar 12 '18 at 20:24

Charlie 木匠

2,234
19
19

score 1 · Answer 11 · answered Apr 19 '18 at 18:19

In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.

SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
                        .withColumn("concatenatedCol",
                                concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));


class JavaSparkSessionSingleton {
    private static transient SparkSession instance = null;

    public static SparkSession getInstance(SparkConf sparkConf) {
        if (instance == null) {
            instance = SparkSession.builder().config(sparkConf)
                    .getOrCreate();
        }
        return instance;
    }
}

The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".

vijayraj34 · Answer 12 · 2021-06-12T17:42:44.027

1

In my case, I wanted a Pipe-'I' delimited row.

from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()

This worked well like a hot knife over butter.

edited Jun 12 '21 at 17:42

answered Dec 05 '20 at 17:54

vijayraj34

2,135
26
27

score 1 · Answer 13 · answered Nov 09 '21 at 06:17

1

use concat method like this:

Dataset<Row> DF2 = DF1
            .withColumn("NEW_COLUMN",concat(col("ADDR1"),col("ADDR2"),col("ADDR3"))).as("NEW_COLUMN")

answered Nov 09 '21 at 06:17

Davoud Malekahmadi

33
5

score 0 · Answer 14 · edited Mar 17 '17 at 05:54

0

Another way to do it in pySpark using sqlContext...

#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])

# Now we can concatenate columns and assign the new column a name 
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))

edited Mar 17 '17 at 05:54

mrsrinivas

34,112
13
125
125

answered Jan 10 '17 at 17:43

Gur

39
4

score 0 · Answer 15 · answered Aug 11 '19 at 17:33

Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like: SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;

Also, from Spark 2.3.0, you can use commands in lines with: SELECT col1 || col2 AS concat_column_name FROM <table_name>;

Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.

score 0 · Answer 16 · edited Oct 29 '20 at 21:43

0

We can simple use SelectExpr as well.

df1.selectExpr("*","upper(_2||_3) as new")

edited Oct 29 '20 at 21:43

lennon310

12,503
11
43
61

answered Jun 07 '20 at 15:19

Deepak Saxena

1

score 0 · Answer 17 · answered Aug 22 '23 at 20:42

0

Spark SQL provides two built-in functions: concat and concat_ws. we use concat to merge multiple strings into single string. concat_ws to merge multiple strings into single string with a delimiter/seperator.

answered Aug 22 '23 at 20:42

Lakshmakumar

1

score -2 · Answer 18 · edited Nov 16 '20 at 21:31

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))

Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))

Concatenate columns in Apache Spark DataFrame

18 Answers18

concat(*cols)

concat_ws(sep, *cols)

map_concat(*cols)

Linked

Related