2

I have the following pyspark dataframe

identification p1 p2 p3 p4
1 1 0 0 1
2 0 1 1 0
3 0 0 0 1

I want to concatenate all columns from p1 to p4 in a way to gather the values of 1 and 0 in this way

identification p1 p2 p3 p4 joined_column
1 1 0 0 1 1001
2 0 1 1 0 0110
3 0 0 0 1 0001

All columns from p1 to p4 are Integer, so I tried to cast them into string before concatenate all columns ( except the 'identification' column ) in this way:


from pyspark.sql.types import StringType 
from pyspark.sql import functions as F

df_concat=df.withColumn('joined_column', F.concat([F.col(c).cast(StringType()) for c in df.columns if  c!='identification']))

I get the following error :

TypeError: Invalid argument, not a string or column: 

Is there any solution or other way to concatenate all columns of a pyspark dataframe into one string ?

Abdessamad139
  • 325
  • 4
  • 16

3 Answers3

2

You have to use concat_ws

from pyspark.sql.functions import concat_ws
df = df.withColumn('joined_column',concat_ws('',df.p1, df.p2, df.p3, df.p4))
s510
  • 2,271
  • 11
  • 18
  • This code alone will not solve the problem. You need to convert p1 to p4 as int type and post that your code will concat the data – BigData-Guru Sep 30 '22 at 13:51
  • Works for me !! Can you please test before commenting? :) – s510 Sep 30 '22 at 14:09
  • I tested in Databricks.. It's coming like 1.00.00.01.0 – BigData-Guru Sep 30 '22 at 14:21
  • I think OP has asked for (py)spark and not Databricks. And clearly the values are integers as the it is shown as `1` and **not** `1.0` – s510 Sep 30 '22 at 15:33
  • I am using pyspark code. Databricks as the interface. For me when I uploaded the csv file having Integer data, it's getting converted as double . May be because of that, your code didn't work. Sorry for the confusion :) – BigData-Guru Sep 30 '22 at 15:39
0

you can use below code. If you will check your schema, the data in those columns may not be integer. You need to convert p1 to p4 as integer type

  from pyspark.sql.types import IntegerType
  
df = df \
  .withColumn("p1" , df["p1"].cast(IntegerType()))   \
  .withColumn("p2", df["p2"].cast(IntegerType()))    \
  .withColumn("p3"  ,df["p3"].cast(IntegerType())) \
  .withColumn("p4"  ,   df["p4"]  .cast(IntegerType())) 

import pyspark
from pyspark.sql import functions as sf
df = df.withColumn('joined_column',sf.concat(sf.col('p1'),sf.col('p2'),sf.col('p3'),sf.col('p4')))

display(df)
BigData-Guru
  • 1,161
  • 1
  • 15
  • 20
0

I will answer the question based on @samkart comment. We don't need to convert each column into a string, it will be converted automatically while concatenating.

from pyspark.sql import functions as F

df_concat=df.withColumn('joined_column', F.concat(*[F.col(c) for c in df.columns if  c!='identification']))

This will be true independent of columns number and names

Abdessamad139
  • 325
  • 4
  • 16