1

I have a pyspark dataframe that has fields: "id", "fields_0_type" , "fields_0_price", "fields_1_type", "fields_1_price"

+------------------+--------------+-------------+-------------+---
|id  |fields_0_type|fields_0_price|fields_1_type|fields_1_price|
+------------------+-----+--------+-------------+----------+
|1234| Return      |45            |New          |50           |
+--------------+----------+--------------------+------------+

How can I save the values of these values into two columns called "type" and"price" as a list and separate the values by ",". The ideal dataframe looks like this:

  +--------------------------- +------------------------------+
    |id     |type              | price
    +---------------------------+------------------------------+
    |1234   |Return,Upgrade    |45,50
    

Note that the data I am providing here is a sample. In reality I have more than just "type" and "price" columns that will need to be combined.

Update:

Thanks it works. But is there any way that I can get rid of the extra ","? These are caused by the fact that there are blank values in the columns. Is there a way just to not to take in those columns with blank values in it? What it is showing now:

+------------------------------------------------------------------+
|type                                                   |
+------------------------------------------------------------------+
|New,New,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,        |
|New,New,Sale,Sale,New,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,|
+------------------------------------------------------------------+

How I want it:

+------------------------------------------------------------------+
    |type                                                   |
    +------------------------------------------------------------------+
    |New,New,New        |
    |New,New,Sale,Sale,New|
    +------------------------------------------------------------------+
LLL
  • 419
  • 2
  • 6
  • 17
  • 1
    replace the blanks with `lit(None)` something like `df = df.select(*[when(col(c) == "", lit(None)).otherwise(col(c)) for c in cols])` before using `concat_ws` OR do a `regexp_replace` to remove trailing commas. – pault Aug 19 '20 at 19:29
  • what does "cols" refer to here? – LLL Aug 19 '20 at 20:20
  • `cols` should be `df.columns` - that was a typo – pault Aug 19 '20 at 23:10

1 Answers1

1

Cast all columns in array then use concat_ws function.

Example:

df.show()
#+----+-------------+-------------+-------------+
#|  id|fields_0_type|fields_1_type|fields_2_type|
#+----+-------------+-------------+-------------+
#|1234|            a|            b|            c|
#+----+-------------+-------------+-------------+

columns=df.columns
columns.remove('id')


df.withColumn("type",concat_ws(",",array(*columns))).drop(*columns).show()
#+----+-----+
#|  id| type|
#+----+-----+
#|1234|a,b,c|
#+----+-----+

UPDATE:

df.show()
#+----+-------------+--------------+-------------+--------------+
#|  id|fields_0_type|fields_0_price|fields_1_type|fields_1_price|
#+----+-------------+--------------+-------------+--------------+
#|1234|            a|            45|            b|            50|
#+----+-------------+--------------+-------------+--------------+

type_cols=[f for f in df.columns if 'type' in f]
price_cols=[f for f in df.columns if 'price' in f]

df.withColumn("type",concat_ws(",",array(*type_cols))).withColumn("price",concat_ws(",",array(*price_cols))).\
drop(*type_cols,*price_cols).\
show()
#+----+----+-----+
#|  id|type|price|
#+----+----+-----+
#|1234| a,b|45,50|
#+----+----+-----+
notNull
  • 30,258
  • 4
  • 35
  • 50