Concatenate all pyspark dataframe columns into one string column

Question

I have the following pyspark dataframe

identification	p1	p2	p3	p4
1	1	0	0	1
2	0	1	1	0
3	0	0	0	1

I want to concatenate all columns from p1 to p4 in a way to gather the values of 1 and 0 in this way

identification	p1	p2	p3	p4	joined_column
1	1	0	0	1	1001
2	0	1	1	0	0110
3	0	0	0	1	0001

All columns from p1 to p4 are Integer, so I tried to cast them into string before concatenate all columns ( except the 'identification' column ) in this way:


from pyspark.sql.types import StringType 
from pyspark.sql import functions as F

df_concat=df.withColumn('joined_column', F.concat([F.col(c).cast(StringType()) for c in df.columns if  c!='identification']))

I get the following error :

TypeError: Invalid argument, not a string or column:

Is there any solution or other way to concatenate all columns of a pyspark dataframe into one string ?

unpack the list -- `concat(*[col(c) ... ])` -- as concat accepts columns only — samkart, Sep 30 '22 at 10:41
Thaanks @samkart it works, I did not pay attention to this small detail — Abdessamad139, Sep 30 '22 at 10:46

score 2 · Answer 1 · answered Sep 30 '22 at 10:54

2

You have to use concat_ws

from pyspark.sql.functions import concat_ws
df = df.withColumn('joined_column',concat_ws('',df.p1, df.p2, df.p3, df.p4))

answered Sep 30 '22 at 10:54

s510

2,271
11
18

This code alone will not solve the problem. You need to convert p1 to p4 as int type and post that your code will concat the data – BigData-Guru Sep 30 '22 at 13:51
Works for me !! Can you please test before commenting? :) – s510 Sep 30 '22 at 14:09
I tested in Databricks.. It's coming like 1.00.00.01.0 – BigData-Guru Sep 30 '22 at 14:21
I think OP has asked for (py)spark and not Databricks. And clearly the values are integers as the it is shown as `1` and **not** `1.0` – s510 Sep 30 '22 at 15:33
I am using pyspark code. Databricks as the interface. For me when I uploaded the csv file having Integer data, it's getting converted as double . May be because of that, your code didn't work. Sorry for the confusion :) – BigData-Guru Sep 30 '22 at 15:39

score 0 · Answer 2 · answered Sep 30 '22 at 13:48

you can use below code. If you will check your schema, the data in those columns may not be integer. You need to convert p1 to p4 as integer type

  from pyspark.sql.types import IntegerType
  
df = df \
  .withColumn("p1" , df["p1"].cast(IntegerType()))   \
  .withColumn("p2", df["p2"].cast(IntegerType()))    \
  .withColumn("p3"  ,df["p3"].cast(IntegerType())) \
  .withColumn("p4"  ,   df["p4"]  .cast(IntegerType())) 

import pyspark
from pyspark.sql import functions as sf
df = df.withColumn('joined_column',sf.concat(sf.col('p1'),sf.col('p2'),sf.col('p3'),sf.col('p4')))

display(df)

score 0 · Accepted Answer · answered Sep 30 '22 at 13:55

I will answer the question based on @samkart comment. We don't need to convert each column into a string, it will be converted automatically while concatenating.

from pyspark.sql import functions as F

df_concat=df.withColumn('joined_column', F.concat(*[F.col(c) for c in df.columns if  c!='identification']))

This will be true independent of columns number and names

Concatenate all pyspark dataframe columns into one string column

3 Answers3

Linked