How to concatenate multiple columns in PySpark with a separator?

Question

I have a pyspark Dataframe, I would like to join 3 columns.

id |  column_1   | column_2    | column_3
--------------------------------------------
1  |     12      |   34        |    67
--------------------------------------------
2  |     45      |   78        |    90
--------------------------------------------
3  |     23      |   93        |    56
--------------------------------------------

I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-"

Expect result:

id |  column_1   | column_2    | column_3    |   column_join
-------------------------------------------------------------
1  |     12      |     34      |     67      |   12-34-67
-------------------------------------------------------------
2  |     45      |     78      |     90      |   45-78-90
-------------------------------------------------------------
3  |     23      |     93      |     56      |   23-93-56
-------------------------------------------------------------

How can I do it in pyspark ? Thank you

pissall · Accepted Answer · 2019-11-26T21:10:58.483

10

It's pretty simple:

from pyspark.sql.functions import col, concat, lit

df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))

Use concat to concatenate all the columns with the - separator, for which you will need to use lit.

If it doesn't directly work, you can use cast to change the column types to string, col("column_1").cast("string")

UPDATE:

Or you can use a more dynamic approach using a built-in function concat_ws

pyspark.sql.functions.concat_ws(sep, *cols)

Concatenates multiple input string columns together into a single string column, using the given separator.

>>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect()
[Row(s=u'abcd-123')]

Code:

from pyspark.sql.functions import col, concat_ws

concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))

edited Nov 26 '19 at 21:10

answered Nov 25 '19 at 13:13

pissall

7,109
2
25
45

please I need to your help if you have a suggestion about this question. https://stackoverflow.com/questions/59197109/looking-if-string-contain-a-sub-string-in-differents-dataframes?noredirect=1#comment104618006_59197109 – verojoucla Dec 05 '19 at 19:27
@verojoucla Please follow pault's comment on that question. Your answer is literally answered already – pissall Dec 06 '19 at 04:07
Thank you for your answer, I asked a new question, but for looking on my solution, can you have a look at please, I just want to create a new column in each dataframe that contain a monotonically_increasing_id, please have a look https://stackoverflow.com/questions/59211575/how-to-find-an-optimized-join-between-2-different-dataframes-in-spark – verojoucla Dec 06 '19 at 12:01
Do you have some idea about this question please ? https://stackoverflow.com/questions/59931770/sum-of-column-values-pyspark thanks – verojoucla Jan 27 '20 at 13:35
Great answer thank you! Regarding the dynamic answer, what does the `*` signify? – ocean800 Jun 30 '23 at 09:43

cph_sto · Answer 2 · 2019-11-25T14:59:34.020

Here is a generic/dynamic way of doing this, instead of manually concatenating it. All we need is to specify the columns that we need to concatenate.

# Importing requisite functions.
from pyspark.sql.functions import col, udf

# Creating the DataFrame
df = spark.createDataFrame([(1,12,34,67),(2,45,78,90),(3,23,93,56)],['id','column_1','column_2','column_3'])

Now, specifying the list of columns we want to concatenate, separated by -.

list_of_columns_to_join = ['column_1','column_2','column_3']

Finally, creating a UDF. Mind it, UDF based solutions are implicitly slower.

def concat_cols(*list_cols):
    return '-'.join(list([str(i) for i in list_cols]))

concat_cols = udf(concat_cols)
df = df.withColumn('column_join', concat_cols(*list_of_columns_to_join))
df.show()
+---+--------+--------+--------+-----------+
| id|column_1|column_2|column_3|column_join|
+---+--------+--------+--------+-----------+
|  1|      12|      34|      67|   12-34-67|
|  2|      45|      78|      90|   45-78-90|
|  3|      23|      93|      56|   23-93-56|
+---+--------+--------+--------+-----------+

How to concatenate multiple columns in PySpark with a separator?

2 Answers2

Related